# Posts Tagged ‘data model

### Integrating metadata into the data model

Mathematical models define infinite precision real numbers and functions with infinite domains, whereas computer data objects contain finite amounts of information and must therefore be approximations to the mathematical objects that they represent. Several forms of scientific metadata serve to specify how computer data objects approximate mathematical objects and these are integrated into our data model. For example, missing data codes (used for fallible sensor systems) may be viewed as approximations that carry no information. Any value or sub-object in a VIS-AD data object may be set to the missing value. Scientists often use arrays for finite samplings of continuous functions, as, for example, satellite image arrays are finite sampling of continuous radiance fields. Sampling metadata, such as those that assign Earth locations to pixels, and those that assign real radiances to coded (e.g., 8-bit) pixel values, quantify how arrays approximate functions and are integrated with VIS-AD array data objects.

The integration of metadata into our data model has practical consequences for the semantics of computation and display. For example, we define a data type goes_image as an array of ir radiances indexed by lat_lon values. Arrays of this data type are indexed by pairs of real numbers rather than by integers. If goes_west is a data object of type goes_image and loc is a data object of type lat_lon then the expression goes_west[loc] is evaluated by picking the sample of goes_west nearest to loc. If loc falls outside the region of the Earth covered by goes_west pixels then goes_west[loc] evaluates to the missing value. If goes_east is another data object of type goes_image, generated by a satellite with a different Earth perspective, then the expression goes_west – goes_east is evaluated by resampling goes_east to the samples of goes_west (i.e., by warping the goes_east image) before subtracting radiances. In Earth regions where the goes_west and goes_east images do not overlap, their difference is set to missing values. Thus metadata about map projections and missing data contribute to the semantics of computations.

Metadata similarly contribute to display semantics. If  both goes_east and goes_west are selected for display, the system uses the sampling of their indices to co-register these two images in a common Earth frame of reference. The samplings of 2-D and 3-D array indices need not be Cartesian. For example, the sampling of lat_lon may define virtually any map projection. Thus data may be displayed in non-Cartesian coordinate systems.

### Extensions of Relational and Object oriented Database Systems

In this approach a relational or object-oriented database system is extended to support SGML/XML data management. The proposed SGML extensions included, for example, a system where SGML files were mapped to the O2 database management system, and the extension of operators of SQL to accommodate structured text. All current commercial database systems provide some XML support. Examples of commercial systems are Oracle’s XML SQL Utility and IBM’s DB2 XML Extender. For the sake of discussion, we consider IBM’s DB2 XML Extender as representative of the many systems following this approach.

Data model: When conventional database systems are used for XML, data structuring is systematic and explicitly defined by a database schema. The data model of the original system is typically extended to encompass XML data, but the extensions define simplified tree models rather than rich XML documents.The XML extensions are intended primarily to support the management of enterprise data, wrapped as elements and attributes in an XML document. A problem in using the systems is the need for parallel understanding of two different kinds of data models.

Data definition: The extended systems require explicit definition of transformation of a DTD to the internal structures. XML elements are typically mapped to objects in object-oriented systems, but relational systems require more elaborate transformations to represent hierarchic and ordered structures in unordered tables. In the DB2 XML Extender the whole document can be stored either externally as a file or as a whole in a column of a table. Elements and attributes can also be stored separately inside tables, which can be accessed independently or used for selecting whole documents (as if the side tables were indexes). DTDs, which are stored in a special table, can be associated with XML documents and used to validate them.

Data manipulation: In relational extensions, whole documents and DTDs that are stored in tables can be accessed and manipulated through the SQL database language. As explained above, specific elements of XML data can be extracted when documents are loaded, maintained separately, and accessed directly through SQL. Support for accessing elements that have not been extracted as part of document loading is provided through limited XPath queries, and the DB2 XML Extender can be used together with DB2 UDB Text for full-text search. DB2 also provides document assembly via a function call that can be embedded in an SQL query.

### Analysis of the index data model from a databases perspective

The logical representation of indexes is an abstraction for their actual physical implementation(e.g. inverted indexes, suffix trees, suffix arrays or signature files). This abstraction resembles the data independence principle exploited by databases and, by further investigation, it appears clear how databases and search engine indexes have some similarities in the nature of their data structures: in the relational model we refer to a table as a collection of rows having a uniform structure and intended meaning; a table is composed by a set of columns, called attributes, having values taken from a set of domains (like integers, string or boolean values). Likewise, in the index data model, we refer to an index as a collection of documents of a given (possibly generic) type having uniform structure and intended meaning where a document is composed of a (possibly unitary) set of fields having values also belonging to different domains (string, date, integer etc).

Differently from the databases, though, search engine indexes do not have functional dependencies nor inclusion dependencies defined for their fields, except for an implied key dependency used to uniquely identify documents into an index. Moreover, it is not possible to define join dependencies between fields belonging to different indexes. Another difference enlarging the gap between the database data model and the index data model is the lack of standard data definition and data manipulation languages. For example both in literature and in industry there is no standard query language convention (such as SQL for databases) for search engines; this heterogeneity is mainly due to a high dependency of the adopted query convention to the structure and to the nature of the items in the indexed collection.

In its simplest form, for a collection of items with textual representation, a query is composed of keywords and the items retrieved contain these keywords. An extension of this simple querying mechanism is the case of a collection of structured text documents, where the use of index fields allows users to search not only in the whole document but also in its specific attributes. From a database model perspective, though, just selection and projection operators are available: users can specify keyword-based queries over fields belonging to the document structure and, possibly, only a subset of all the fields in the document can be shown as result.