web_page_classification_note
2016-01-28 13:22:33 0 举报
AI智能生成
网页分类论文-阅读比较
作者其他创作
大纲/内容
INTRODUCTION
Problem Definition
二元分类
多分类、单标签(j结果)
多分类、多标签(label)
多分类、多标签、不同权重
Applications of Web Classification
构建、扩展Web Directories
单层和多层Flat classification and hierarchical classification
https://www.dmoz.org/ 的定义
提高Quality of Search Results
帮助Question Answering Systems.
通过decision tree classifiers 分类
collection pages (containing a list of items)
topic pages
(representing an answer instance)
relevant pages (supporting an answer instance)
irrelevant pages.
Building Efficient Focused Crawlers or Vertical (Domain-Specific) Search Engines
Other Applications
Web content filtering
contextual advertising
ontology annotation
knowledge base construction
The Difference Between Web Classification and Text Classification
traditional
text classification is typically performed on structured documents
Web pages are semistructured documents in HTML
a feature is
central to the definition of the Web
Related Surveys
FEATURES
Using On-Page Features
feature selection to make better
use of the textual features
n-gram representation:使用短文本做单位组成vector
HTML tags:title, headings, metadata, and main text
Good-quality document summarization can accurately represent the major topic of
a Web page.
Visual Analysis.
Using Features of Neighbors
Motivation.
当features are sometimes missing, misleading,可以使用该page的相邻(相关)的page去判断
Underlying Assumptions
如果Pa和Pb都属于某分类,其视觉上的邻居节点也属于该分类
Neighbor Selection
使用父页面货指向target page的超链的锚文本或附近的content,会更重要
Features of Neighbors
parent, child, sibling, and spouse pages are useful ,但是sibling page的效果最好
(来来去去都是这些)The features that have been used from neighbors include
labels, partial content (anchor text, the surrounding text of anchor text, titles, headers),
and full content
Utilizing Artificial Links.
Feature selection
summary of news articles.--使用新闻的概述-- only using the
first fragment of each document offers fast and accurate classification of news articles
信息增益、互信息、文档频率,和χ2测试 information gain, mutual information,
document frequency, and the χ2 test.
Latent semantic indexing潜在语义索引
A matrix factorization矩阵分解
Discussion: Features
ALGORITHMS
Dimension Reduction 缩减维度
Feature selection reduces the dimensionality of the feature space选择特征用以降纬
summary of news articles.--使用新闻的概述-- only using the
first fragment of each document offers fast and accurate classification of news articles
信息增益、互信息、文档频率,和χ2测试 information gain, mutual information,
document frequency, and the χ2 test.
Latent semantic indexing潜在语义索引
A matrix factorization矩阵分解
Relational Learning
由于网页是有超链关联的,relational learning problem
Relaxation labeling松弛标示法
loopy belief propagation and iterative classification置信度传播和迭代分类
Modifications to Traditional Algorithms
k-Nearest Neighbor classifiers
binary classification scenario,
SVM classifier:can then
be trained on the labeled positive examples and the filtered negative examples
Hierarchical Classification:层次分类
hierarchical SVMs 效果一般
Combining Information from Multiple Sources:结合不同的数据源
voting and stacking
Combining SVM kernels组合支持向量机
Moreover, the combination of two does not always perform better than each separately 不一定组合的就比单独的要好
OTHER ISSUES
Web Page Content Preprocessing
Dataset Selection and Generation
supervised learning problem
Web Site Classification
Blog Classification
1、a binary classification of blog and nonblog.是否是博客
The second category of research includes identification of the topic, mood or, sentiment
of blogs. 从词的心情看观察
the genre of blogs.
CONCLUSION
supervised learning problem on the basis of subject,
function, sentiment, genre, and more.
收藏
0 条评论
下一页