web_page_classification_note
2016-01-28 13:22:33 0 举报
AI智能生成
网页分类论文-阅读比较
作者其他创作
大纲/内容
WebPage Classification
INTRODUCTION
Problem Definition
二元分类
多分类、单标签(j结果)
多分类、多标签(label)
多分类、多标签、不同权重
Applications of Web Classification
构建、扩展Web Directories
单层和多层Flat classification and hierarchical classification
https://www.dmoz.org/ 的定义
提高Quality of Search Results
帮助Question Answering Systems.
通过decision tree classifiers 分类
collection pages (containing a list of items)
topic pages(representing an answer instance)
relevant pages (supporting an answer instance)
irrelevant pages.
Building Efficient Focused Crawlers or Vertical (Domain-Specific) Search Engines
Other Applications
Web content filtering
contextual advertising
ontology annotation
knowledge base construction
The Difference Between Web Classification and Text Classification
traditionaltext classification is typically performed on structured documents
Web pages are semistructured documents in HTML
a feature iscentral to the definition of the Web
Related Surveys
FEATURES
Using On-Page Features
feature selection to\u00A0make betteruse of the textual features
n-gram representation:使用短文本做单位组成vector
Good-quality document summarization can accurately represent the major topic ofa Web page.
Visual Analysis.
Using Features of Neighbors
Motivation.
Underlying Assumptions
如果Pa和Pb都属于某分类,其视觉上的邻居节点也属于该分类
Neighbor Selection
使用父页面货指向target page的超链的锚文本或附近的content,会更重要
Features of Neighbors
Utilizing Artificial Links.
Feature selection
summary of news articles.--使用新闻的概述-- only using thefirst fragment of each document offers fast and accurate classification of news articles
Latent semantic indexing潜在语义索引
A matrix factorization矩阵分解
Discussion: Features
ALGORITHMS
Dimension Reduction 缩减维度
Feature selection reduces the dimensionality of the feature space选择特征用以降纬
Relational Learning
由于网页是有超链关联的,relational learning problem
Relaxation labeling松弛标示法
loopy belief\u00A0propagation and iterative classification置信度传播和迭代分类
Modifications to Traditional Algorithms
k-Nearest Neighbor classifiers
SVM classifier:can thenbe trained on the labeled positive examples and the filtered negative examples\u00A0
Hierarchical Classification:层次分类
hierarchical\u00A0SVMs \u00A0效果一般
Combining Information from Multiple Sources:结合不同的数据源
voting and stacking
Combining SVM kernels组合支持向量机
OTHER ISSUES
Web Page Content Preprocessing
Dataset Selection and Generation
supervised learning problem
Web Site Classification
Blog Classification
1、a binary classification of blog and nonblog.是否是博客
the genre of blogs.
CONCLUSION
收藏
0 条评论
下一页