首页  思维导图  详情

Knowledge Technology

2018-06-07 09:26:14   3  举报





AI智能生成

COMP90049 Knowledge Technologies University of Melbourne IT 2018 S1

Machine Learning

作者其他创作

大纲/内容

String Search

Exact String Search

Pattern Matching: RegEx

Naive Matching

Boyer-Moore matching

Approximate String Search

Neighbourhood Search

Edit Distance

Global

Local

N-Gram Distance

Phonetics (Soundex)

Evaluation

Accuracy: fraction of correct responses

Precision: fraction of correct responses among attempted responses

Recall: proportion of words with a correct response (somewhere)

Information retrieval

Information Seeking

Information needs

Informational: global warming

Factoid: melting point of lead

Topic tracking: Trump administration

Navigational: university of Melbourne

Transactional: Macbook Air

Geospatial: carlton restaurants

Answers

relevant

Document Matching

Boolean Query

match if they contain the terms, and don't contain the NOT terms

Pros

repeatable, auditable and controllable.

allow expression of complex concepts.

reduction in time spent reading

Cons

no ranking and no control over result set size

difficult to incorporate useful heuristics

Ranked Retrieval

Evidence of similarity

query terms in the title

query terms as phrases

recently created

translate between languages

authoritative, reliable documents

TF-IDF Model

More weight is given to a document where a query term appears many times.
(Term frequency or TF.)

Less weight is given to a term that appears in many documents.
(Inverse document frequency or IDF.)

Less weight is given to a document that has many terms.

Evaluation

Precision: number of returned relevant results/number of returned results

Recall: number of returned relevant results/total number of relevant results
(often useless in an IR context)

Precision at k (P@k): number of returned relevant results in top k/k

(Recall at k usually not meaningful)

Web Search

Crawling
(gather data from web)

Basic challenge: there is no central index of URLs of interest.

Secondary challenges:

Some websites return the same content as a new URL at each visit.

Some pages never return status 'done' on access.

Some websites are not intended to be crawled.

Much web content is generated on-the-fly from databases,
which can be costly for the content provider, so excessive numbers of visits to a site are unwelcome.

Some content has a short lifespan.

Some regions and content providers have low bandwidth.

URLs L:

Every page is visited eventually

Synonym URLs are disregarded

Significant or dynamic pages are visited sufficiently frequently

The crawler isn't cycling indefinitely in a single web site (caught in a crawler trap)

Parsing
(translate data into a canonical form)

The aim of parsing is to reduce a web page, or a query, to a sequence of tokens

Canonicalisation
(having a single representation of a text)

Tokenisation
(decomposing the larger document into smaller information-bearing units)

Normalisation
(a form which is generally representative of other instances of the same idea)

Tokenisation steps

Folding case

transfer to lowercase

Stripping punctuation

remove non-alphabetic characters

Stemming

remove affixes

Splitting into tokens

base on whitespace

Zoning

segment documents to discrete zones e.g. title, anchor text, headings

Indexing
(Build efficient data structure)

inverted index:
a collection of lists, one per term, recording the identifiers of the documents containing that term.

Querying
(process data in response to queries)

Boolean Querying

fetch each term

use intersection for AND

use union for OR

take complement for NOT

ignore within-document frequencies

Ranked Querying

accumulator
(the partial sum of the dot product)

calculate Wq,t and fetch the inverted list for t

calculate Wd,t and set A=A+Wq,t*Wd,t

set A=A/W

identify greatest A and return corresponding documents

accumulator cost

limiting:
processing the most important terms first, namely, the ones with highest IDF

thresholding:
Only create an accumulator when the TF-IDF value is above some threshold

Ass-ons

Phrase Queries

returning only documents where the query terms appear in sequence

use a "positional index": an index where the positions of terms (indices within the document)

Link Analysis

a weighting (or re-ranking) procedure

based on the link structure of the Web

assumption: popular pages are more likely to be relevant

PageRank

we begin at all pages with equal probability

traversal

follow one of the outgoing links from this page, with even probability

teleport

navigating via entering a URL, with even probability

Machine Learning

Naive Bayes

assumption: each attribute is independent of all of the other attributes, given the class under consideration

maximum likelihood estimation (hedge using smoothing)

Decision Tree

GINI Index

1-p^2

best: 0; worst: 0.5

Entropy

-p log(p)

best: 0; worst: 1

Information Gain: Entropy(p)-(sum of Entropy)

Error

1-max(p)

best: 0; worst: 0.5

Pros

inexpensive to construct

extremely fast at clsaaifying

easy to interpret

good accuracy

Cons

restrictive

too specific and complex

Random Forests

Bagging

build a number of Decision Trees by re-sampling the data

randomly select (with repetition) N instances

build the tree as usual

classify the test instance by voting

for each node in the tree, we randomly select some subset of the possible attributes

Cons

deal with outlier instances in the original dataset

overcome the problem of irrelevant attributes

build many trees in a reasonable amount of time

K-Nearest Neighbour

Euclidian distance (p,q)

Minkowski distance (p,q,k)

Manhattan distance

Support Vector Machine

best line(hyper-plane) that divides two classes

linear seperability

maximum margin

soft margin

allow wrong points, if can get a larger margin

Kernel functions

the data becomes linearly separable

"multi-class" classifier

build multiple models

one versus many

one versus one

Evaluation

Accuracy

TP+TN/TP+TN+FP+FN

Precision

TP/TP+FP

Recall (Sensitivity)

TP/TP+FN

Specificity

TN/TN+FP

F1_Socre

2*Recall*Precision/Recall+Precision

Data Mining

main modes

classification

e.g. choose best label

clustering

e.g. find group of something

regression

e.g. predict sea level

association rule mining

e.g. find purchase predictive of other purchase

sequential discovery

e.g. find weather patterns in a given city

outlier detection

e.g. find erroneous values in a collection of data

Clustering

unsupervised machine learning (absence of labelled instance)

partitional (flat) & hierarchical (higher-up)

fuzzy&non-fuzzy, partial&complete, heterogeneous&homegeneous etc.

well-separated, center-based contiguous, desity-based

K-means

Association Rules

basic elements

Support count (σ)

Support

Confidence(c)

threshold minsup, minconf

Frequent Itemset
Generation Strategies

Reduce the number of candidates (M)

Complete search: M=2^d

Use pruning techniques

Reduce the number of transactions (N)

Used by DHP (Direct Hashing and Pruning)
and vertical-based mining algorithms

Reduce the number of comparisons (NM)

Use efficient data structures to store candidates/transactions

No need to match every candidate against every transaction

Apriori Algorithm

Generate frequent itemsets of length k (k=1)

Generate length (k+1) candidate itemsets from length k frequent itemsets

Prune candidates that are infrequent in k+1

Count the support of each candidate by scanning the DB

Eliminate candidates that are infrequent

 收藏

立即使用

Database

 收藏

立即使用

Knowledge Technology

周子豪

职业：Master

去主页





0 条评论

下一页

为你推荐

查看更多



Key BPM technology capabilities

Building the Value Chain

Succession Planning and Knowledge Transfer

Knowledge Gap

Secondary Sources of Brand Knowledge