Boolean Model

- Strength

  • Rich expressions for queries
  • Clear logical interpretation

- Problems

  • Relevancy ( = Score , two component [ query, document ] ) is either 1 or 0
    many documents or few/no documents in the result
    No term weighting in document and query is used
  • Difficulty for end-users for form a correct Boolean query
  • Problem with Boolean search
    • Boolean queries often result in either too few (=0) or too many (1000s) results
      Example)
      Query 1: “standard user iptime N4” → 200,000 hits
      Query 2: “standard user iptime N4 no channel found” → 0 hits
    • It takes a lot of skill to come up with a query that produces a manageable number of hits
      ( AND gives too few; OR gives too many )

–> Solution : Ranked Retrieval

Ranked retrieval

Using Free Text Queries

- Feast or famine: not a problem in ranked retrieval

- Query Document Matching Scores
( 해당 term 이 많이 있으면 High Score 부여 - Jaccard coefficient )

Jaccard coefficient

screenshot

  • A and B don’t have to be the same size
  • Always assings a nubmer between 0 and 1

Example)

Query : idess of march

Doc1 : caesar died in march → jaccard : 1/6

Doc2 : the long march → jaccard : 1/5

→ 의미상으론 Doc1 이 더 가깝지만 jaccard 를 통해 Doc2 가 더 높은 rank 를 부여받는다.

  • We need a more sophisticated way of normalizing for length

screenshot

Bag of words model

- Term Frequency

  • don’t consider ordering of words
  • Term Frequency : tf
  • The term frequency tf(t,d) of term t in document d is defined as the number of times that t occurs in d.

Log-frequency weighting

screenshot

Example)

Doc1 : Hanyang Ansan Univ Hanyang

Query : Hanyang Ansan

→ score(query,doc1) = (1 + log(2)) : Hanyang 2번 + (1 + log(0)) : Ansan 0번 = 2

- Document frequency

  • Rare terms are more informative than frequent terms

    → We want a high weight for rare terms like arachnocentric.8

  • Document frequency : df

idf weight

screenshot

Effect of idf on ranking

  • idf has no effect on ranking one term queries
    ( if one term, same ranking with df )
  • idf affects the ranking of documents for queries with at least two terms

tf-idf weighting

screenshot

Score for a document given a query

screenshot

Example)

Query : Hanyang Univ

Doc1 : Hanyang Ansan Univ Hanyang

Doc2 : Hanyang Ansan

→ score(query, doc1) = [ ( 1 + log(2) ) x log (10/2) ] : Hanyang + [ 1 x log(10/1) ] : Univ

→ score(query, doc2) = [ (1) x log(10/1) ] : Hanyang