Dictionaries and tolerant retrival

Hashtables

Simplest Binary tree
More usual : B+ tree
Trees require a standard ordering of characters and hence strings … but, we typically have one
Pros : Solves the prefix problem (ex. terms starting with hany)
Optimized for disk-based retrieval
Cons : Slower => O(logM) ( this requires balanced tree ) / Rebalancing binary trees is expensive

Indexing mechanisms used to speed up access to desired data.
Search Key : attribute to set of attributes used to look up records in a file
Index file : consists of records ( called index entries ) of them
TWO basic kind of indices :
1. Ordered Indices : search keys are stored in sorted order
2. Hash indices : search keys are distributed uniformly across “bucket” using a “hash function”

In an Ordered Index, Index entries are stored sorted on the search key value
Primary Index (Clustering index): in sequentially ordered fiel, this index whose search key specifies the sequential order of the file.
Secondary Index (“non-clustering index”): ans index whose search by key specfiies an order diffent from the sequential of the file.

Dens-Index - Index record appears for every search-key value in the file.

screenshot

screenshot

contatins index records for only some search-key values
To locate a record with search key value k we

Find index record with search-key value K we

Search file sequentially starting at the reocord to which the index record points

Indexed record points to a buket that contains pointer

to all the actual records with that particular search-key value
Secondary indics hav to be dense

screenshot

Indices offer substantial benefits when searching for records
BUT: Updating indices imposes overhead on database modification when a file is modifided, every index on the file must be updated.
Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive

If primary index does not fit in memory, access becomes expensive
Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main memory, yet another level of index can be created, and so on
Indices at all levels must be updated on insertion or deletion from the file.

screenshot

All paths from root to leaf are of the same length (balanced)

==> Complete Binary Rebalanceing Tree
Each node that is not a root or a leaf has between [n/2] and n children
A leaf node has between [(n−1)/2] and n–1 values
Special cases:

1.If the root is not a leaf, it has at least 2 children.

2.If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values.

Disk-base data structure (not main memory)

- Page

Fanout n of a node: the number of pointers out of the node