Wild - Card Queries

- “ * “ / “ $ “

  • ” * “ means 0 or more occurrences
  • ” $ “ represents the end ( m$ : m 으로 끝남 , $m : m 으로 시작함 )

Example

mon* : find all docs containing any word beginning with "mon"

Easy with Binary tree ( or B+ Tree ) lexicon ( 사전 ) : retrieve all words in range : mon <= w < moo

*mon : find words ending in "mon" : harder

Maintain an additional B+ Tree for terms backwards

- Query Processing

  • At this point, we have an enumeration of all terms in the Dictionary that match the wild-card query
  • We still have to look up the postings for each enumerated term

Example

se * ate
se *  AND * ate : This may result in the execution of many Boolean AND queries.

–> Expensive ( B+ Tree 를 하나 더 사용하게 된다 )

Solution : Transform wild-card queries so that the “ * “ occur at the end

This give rise to the Permuterm Index

- Permuterm Index

From term hello, index under :
- hello$ , ello$h, llo$he, lo$hel, o$hell, $hello 
( 한칸씩 Shifting 하면서 해당 단어에 대한 Permuterm  전부 만든다 )

해당 Word 에 대해 모든 Permuterm 을 Dictinary 에 넣고 Search 시 사용한다.

screenshot

Search H * llo
- hello -> llo$he, ...
- hillo -> llo$hi, ...
- haaallo -> llo$haaa, ...
--> 미리 Dictionary  정의된 단어들의 Permuterm   Search word  match 되는 것을 찾는 방식

Exercise

1. Write down the entries in the permuterm index dictionary that are generated by the term mama.
- ANS :  mama$, ama$m, ma$ma, a$mam, $mama
2. If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?
- ANS : ng$s *
3.
 - Consider again the query fi * mo * er.
 - What Boolean query on a bigram index would be generated for this query?
 	- ANS : The Boolean query is $f AND fi AND mo AND er AND r$.
 - Can you think of a term that matches the permuterm query, but does not satisfy this Boolean query?
 	- ANS : The "filibuster" term will match the permuterm query, but does not satisfy this Boolean query

- Bigram ( k - gram ) Indexes

  • Enumerate all k - grams ( sequence of k chars ) occuring in any term

screenshot

  • $ is a special word boundary symbol
  • Maintain a second inverted index from bigrams to dictionary terms that match each bigram

screenshot

Example

he * llo
- he -> hello, he, she, loche, ...
- lo -> hello, location, log, loche, ...
Check match he * lo ---> hello, loche, ...

- Processing wild - cards

  • Query mon * can now be run as
    $m AND mo AND on

  • Gets terms that match AND version of our wildcard query
  • But we may enumerate moon too!
    • Must post-filter these terms against query.
    • Or positional index can handle this problem
  • Surviving enumerated terms are then looked up in the term-document inverted index
  • Fast, space efficient ( Compared to permuterm )
  • As before, we must execute a Boolean query for each enumerated, filtered term
  • Wild-cards can result in expensive query execution (very large disjunctions…)