Knowledge Technology
2018-06-07 09:26:14 3 举报
AI智能生成
COMP90049 Knowledge Technologies University of Melbourne IT 2018 S1
作者其他创作
大纲/内容
String Search
Exact String Search
Pattern Matching: RegEx
Naive Matching
Boyer-Moore matching
Approximate String Search
Neighbourhood Search
Edit Distance
Global
Local
N-Gram Distance
Phonetics (Soundex)
Evaluation
Accuracy: fraction of correct responses
Precision: fraction of correct responses among attempted responses
Recall: proportion of words with a correct response (somewhere)
Information retrieval
Information Seeking
Information needs
Informational: global warming
Factoid: melting point of lead
Topic tracking: Trump administration
Navigational: university of Melbourne
Transactional: Macbook Air
Geospatial: carlton restaurants
Answers
relevant
Document Matching
Boolean Query
match if they contain the terms, and don't contain the NOT terms
Pros
repeatable, auditable and controllable.
allow expression of complex concepts.
reduction in time spent reading
Cons
no ranking and no control over result set size
difficult to incorporate useful heuristics
Ranked Retrieval
Evidence of similarity
query terms in the title
query terms as phrases
recently created
translate between languages
authoritative, reliable documents
TF-IDF Model
More weight is given to a document where a query term appears many times.
(Term frequency or TF.)
(Term frequency or TF.)
Less weight is given to a term that appears in many documents.
(Inverse document frequency or IDF.)
(Inverse document frequency or IDF.)
Less weight is given to a document that has many terms.
Evaluation
Precision: number of returned relevant results/number of returned results
Recall: number of returned relevant results/total number of relevant results
(often useless in an IR context)
(often useless in an IR context)
Precision at k (P@k):
number of returned relevant results in top k/k
(Recall at k usually not meaningful)
Web Search
Crawling
(gather data from web)
(gather data from web)
Basic challenge: there is no central index of URLs of interest.
Secondary challenges:
Some websites return the same content as a new URL at each visit.
Some pages never return status 'done' on access.
Some websites are not intended to be crawled.
Much web content is generated on-the-fly from databases,
which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome.
which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome.
Some content has a short lifespan.
Some regions and content providers have low bandwidth.
URLs L:
Every page is visited eventually
Synonym URLs are disregarded
Significant or dynamic pages are visited sufficiently frequently
The crawler isn't cycling indefinitely in a single web site (caught in a crawler trap)
Parsing
(translate data into a canonical form)
(translate data into a canonical form)
The aim of parsing is to reduce a web page, or a query, to a sequence of tokens
Canonicalisation
(having a single representation of a text)
(having a single representation of a text)
Tokenisation
(decomposing the larger document into smaller information-bearing units)
(decomposing the larger document into smaller information-bearing units)
Normalisation
(a form which is generally representative of other instances of the same idea)
(a form which is generally representative of other instances of the same idea)
Tokenisation steps
Folding case
transfer to lowercase
Stripping punctuation
remove non-alphabetic characters
Stemming
remove affixes
Splitting into tokens
base on whitespace
Zoning
segment documents to discrete zones e.g. title, anchor text, headings
Indexing
(Build efficient data structure)
(Build efficient data structure)
inverted index:
a collection of lists, one per term, recording the identifiers of the documents containing that term.
a collection of lists, one per term, recording the identifiers of the documents containing that term.
Querying
(process data in response to queries)
(process data in response to queries)
Boolean Querying
fetch each term
use intersection for AND
use union for OR
take complement for NOT
ignore within-document frequencies
Ranked Querying
accumulator
(the partial sum of the dot product)
(the partial sum of the dot product)
calculate Wq,t and fetch the inverted list for t
calculate Wd,t and set A=A+Wq,t*Wd,t
set A=A/W
identify greatest A and return corresponding documents
accumulator cost
limiting:
processing the most important terms first, namely, the ones with highest IDF
processing the most important terms first, namely, the ones with highest IDF
thresholding:
Only create an accumulator when the TF-IDF value is above some threshold
Only create an accumulator when the TF-IDF value is above some threshold
Ass-ons
Phrase Queries
returning only documents where the query terms appear in sequence
use a "positional index": an index where the positions of terms (indices within the document)
Link Analysis
a weighting (or re-ranking) procedure
based on the link structure of the Web
assumption: popular pages are more likely to be relevant
PageRank
we begin at all pages with equal probability
traversal
follow one of the outgoing links from this page, with even probability
teleport
navigating via entering a URL, with even probability
Machine Learning
Naive Bayes
assumption: each attribute is independent of all of the other attributes, given the class under consideration
maximum likelihood estimation (hedge using smoothing)
Decision Tree
GINI Index
1-p^2
best: 0; worst: 0.5
Entropy
-p log(p)
best: 0; worst: 1
Information Gain: Entropy(p)-(sum of Entropy)
Error
1-max(p)
best: 0; worst: 0.5
Pros
inexpensive to construct
extremely fast at clsaaifying
easy to interpret
good accuracy
Cons
restrictive
too specific and complex
Random Forests
Bagging
build a number of Decision Trees by re-sampling the data
randomly select (with repetition) N instances
build the tree as usual
classify the test instance by voting
for each node in the tree, we randomly select some subset of the possible attributes
Cons
deal with outlier instances in the original dataset
overcome the problem of irrelevant attributes
build many trees in a reasonable amount of time
K-Nearest Neighbour
Euclidian distance (p,q)
Minkowski distance (p,q,k)
Manhattan distance
Support Vector Machine
best line(hyper-plane) that divides two classes
linear seperability
maximum margin
soft margin
allow wrong points, if can get a larger margin
Kernel functions
the data becomes linearly separable
"multi-class" classifier
build multiple models
one versus many
one versus one
Evaluation
Accuracy
TP+TN/TP+TN+FP+FN
Precision
TP/TP+FP
Recall (Sensitivity)
TP/TP+FN
Specificity
TN/TN+FP
F1_Socre
2*Recall*Precision/Recall+Precision
Data Mining
main modes
classification
e.g. choose best label
clustering
e.g. find group of something
regression
e.g. predict sea level
association rule mining
e.g. find purchase predictive of other purchase
sequential discovery
e.g. find weather patterns in a given city
outlier detection
e.g. find erroneous values in a collection of data
Clustering
unsupervised machine learning (absence of labelled instance)
partitional (flat) & hierarchical (higher-up)
fuzzy&non-fuzzy, partial&complete, heterogeneous&homegeneous etc.
well-separated, center-based contiguous, desity-based
K-means
Association Rules
basic elements
Support count (σ)
Support
Confidence(c)
threshold minsup, minconf
Frequent Itemset
Generation Strategies
Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2^d
Use pruning techniques
Reduce the number of transactions (N)
Used by DHP (Direct Hashing and Pruning)
and vertical-based mining algorithms
and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store candidates/transactions
No need to match every candidate against every transaction
Apriori Algorithm
Generate frequent itemsets of length k (k=1)
Generate length (k+1) candidate itemsets from length k frequent itemsets
Prune candidates that are infrequent in k+1
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent
收藏
收藏
0 条评论
下一页