Posts Tagged ‘signature files

Analysis of the index data model from a databases perspective

The logical representation of indexes is an abstraction for their actual physical implementation(e.g. inverted indexes, suffix trees, suffix arrays or signature files). This abstraction resembles the data independence principle exploited by databases and, by further investigation, it appears clear how databases and search engine indexes have some similarities in the nature of their data structures: in the relational model we refer to a table as a collection of rows having a uniform structure and intended meaning; a table is composed by a set of columns, called attributes, having values taken from a set of domains (like integers, string or boolean values). Likewise, in the index data model, we refer to an index as a collection of documents of a given (possibly generic) type having uniform structure and intended meaning where a document is composed of a (possibly unitary) set of fields having values also belonging to different domains (string, date, integer etc).

Differently from the databases, though, search engine indexes do not have functional dependencies nor inclusion dependencies defined for their fields, except for an implied key dependency used to uniquely identify documents into an index. Moreover, it is not possible to define join dependencies between fields belonging to different indexes. Another difference enlarging the gap between the database data model and the index data model is the lack of standard data definition and data manipulation languages. For example both in literature and in industry there is no standard query language convention (such as SQL for databases) for search engines; this heterogeneity is mainly due to a high dependency of the adopted query convention to the structure and to the nature of the items in the indexed collection.

In its simplest form, for a collection of items with textual representation, a query is composed of keywords and the items retrieved contain these keywords. An extension of this simple querying mechanism is the case of a collection of structured text documents, where the use of index fields allows users to search not only in the whole document but also in its specific attributes. From a database model perspective, though, just selection and projection operators are available: users can specify keyword-based queries over fields belonging to the document structure and, possibly, only a subset of all the fields in the document can be shown as result.

Tags : , , , , , , , , , , , , ,