Posts Tagged ‘SGML

Extensions of Relational and Object oriented Database Systems

In this approach a relational or object-oriented database system is extended to support SGML/XML data management. The proposed SGML extensions included, for example, a system where SGML files were mapped to the O2 database management system, and the extension of operators of SQL to accommodate structured text. All current commercial database systems provide some XML support. Examples of commercial systems are Oracle’s XML SQL Utility and IBM’s DB2 XML Extender. For the sake of discussion, we consider IBM’s DB2 XML Extender as representative of the many systems following this approach.

Data model: When conventional database systems are used for XML, data structuring is systematic and explicitly defined by a database schema. The data model of the original system is typically extended to encompass XML data, but the extensions define simplified tree models rather than rich XML documents.The XML extensions are intended primarily to support the management of enterprise data, wrapped as elements and attributes in an XML document. A problem in using the systems is the need for parallel understanding of two different kinds of data models.

Data definition: The extended systems require explicit definition of transformation of a DTD to the internal structures. XML elements are typically mapped to objects in object-oriented systems, but relational systems require more elaborate transformations to represent hierarchic and ordered structures in unordered tables. In the DB2 XML Extender the whole document can be stored either externally as a file or as a whole in a column of a table. Elements and attributes can also be stored separately inside tables, which can be accessed independently or used for selecting whole documents (as if the side tables were indexes). DTDs, which are stored in a special table, can be associated with XML documents and used to validate them.

Data manipulation: In relational extensions, whole documents and DTDs that are stored in tables can be accessed and manipulated through the SQL database language. As explained above, specific elements of XML data can be extracted when documents are loaded, maintained separately, and accessed directly through SQL. Support for accessing elements that have not been extracted as part of document loading is provided through limited XPath queries, and the DB2 XML Extender can be used together with DB2 UDB Text for full-text search. DB2 also provides document assembly via a function call that can be embedded in an SQL query.

Tags : , , , , , , , , , , , ,

Native SGML/XML Systems

Native SGML/XML systems are designed especially for the management of SGML/XML data. The systems should include capabilities to define, create, store, validate, manipulate, publish,and retrieve SGML/XML documents and their parts. Some of the native systems, such as Astoria and Information Manager, are comprehensive document management systems with front-ends for users to work with documents. Some others, such as SIM and Tamino, are software packages intended for building applications for the management of SGML/XML data. A few systems, especially those that support semi-structured data,such as Lore, XYZFind, and dbXML, provide native support for tree-structured data but are limited in their support of rich XML documents because they do not rely extensively on DTDs or other document type definitions.

The data model: There is no single well-defined data model for XML data. The lack of a well-defined universal conceptual model causes problems in the native systems: for example, the underlying model for XML data is not explicitly defined in Astoria or Tamino, and system-specific notions and models have been invented in SIM. Many of the systems consist of packages of tools that do not share a common data model and may be limited in kind of XML documents they are able to store and manipulate. Unfortunately, because the systems do not highlight the details of the data model, such inconsistencies and constraints are often difficult to detect.

Data definition: The capability to define document types is an important characteristic of XML, and we consider the document type definition capability an essential feature in systems of this category. This aspect severely reduces the utility of semi-structured approaches for managing persistent XML resources. The systems originally developed for SGML are able to use DTDs directly as the document type definition with no translation to some other form of schema. Additional definitions may be needed, however,to support flexible manipulation and efficient implementation. In Astoria an important extension is provided by components, which form the data unit for many operations. For example, access rights are granted at the component level, components can have variants and versions, and simultaneous update to a document by several users is controlled at the component level.

Data manipulation: The lack of a standardized XML query language has led to various system-specific query languages. In addition, the simplified data models restrict query capabilities. For example, since Tamino does not store information about attribute types, queries utilizing IDs and IDREFs are impossible. The response to a Tamino query is an XML document containing the query result as tagged text, plus metadata related to the query ( and time). Thus the query language cannot be applied directly to query results unless a Tamino schema defines them as part of the database. In content management systems such as Astoria and Information Manager, parts of documents can be updated by structure editors integrated with the systems. In both of them style sheets can be associated with documents in their associated editors, and transformations can be defined by means of style sheets. Both of the systems also offer some capabilities for document assembly. In Tamino, database update is applied at the document level. The data storage mechanism for XML data(called X-Machine) has an associated programming language that includes commands for inserting and deleting documents. XSL is used to transform XML documents to HTML for Web publishing,but there is no additional support for defining transformations.

Tags : , , , , , , , ,

Parsing and representing XML and HTML documents with SWI-Prolog

The core of the Web is formed by document standards and exchange protocols. Here we describe tree-structured documents transferred as SGML or XML. HTML, an SGML application, is the most commonly used document format on the Web. HTML represents documents as a tree using a fixed set of elements (tags), where the SGML DTD (Document Type Declaration) puts constraints on how elements can be nested. Each node in the hierarchy has a name (the element-name), a set of name-value pairs known as its attributes and content, a sequence of sub-elements and text (data).

XML is a rationalisation of SGML using the same tree-model, but removing many rarely used features as well as abbreviations that were introduced in SGML to make the markup easier to type and read by humans. XML documents are used to represent text using custom application-oriented tags as well as a serialization format for arbitrary data exchange between computers. XHTML is HTML based on XML rather than SGML. A stable Prolog term-representation for SGML/XML trees plays a similar role as the DOM (Document Object Model ) representation in use in the object-oriented world.

Some issues have been subject to debate.

  1. Representation of text by a Prolog atom is biased by the use of SWI-Prolog which has no length-limit on atoms and atoms that can represent Unicode text. At the same time SWI-Prolog stacks are limited to128 MB each. Using atoms only the structure of the tree is represented on the stack, while the bulk of the data is stored on the unlimited heap. Using lists of character codes is another possibility adopted by both PiLLoW and ECLiPSe. Two observations make lists less attractive: lists use two cells per character while practical experience shows text is frequently processed as a unit only. For (HTML) text-documents we profit from the compact representation of atoms. For XML documents representing serialized data-structures we profit from frequent repetition of the same value.
  2. Attribute values of multi-value attributes (e.g. NAMES) are returned as a Prolog list. This implies the DTD must be available to get unambiguous results. With SGML this is always true, but not with XML.
  3. Optionally attribute values of type NUMBER or NUMBERS are mapped to Prolog numbers. In addition to the DTD issues mentioned above, this conversion also suffers from possible loss of information. Leading zeros and different floating point number notations used are lost after conversion. Prolog systems with bounded arithmetic may also not be able to represent all values. Still, automatic conversion is useful in many applications, especially involving serialized data-structures.
  4. Attribute values are represented as Name=Value. Using Name(Value) is an alternative. The Name=Value representation was chosen for its similarity to the SGML notation and because it avoids the need for univ (=..) for processing argument-lists.

Implementation : The SWI-Prolog SGML/XML parser is implemented as a C-library that has been built from scratch to reach at a lightweight parser. Total sourceis 11,835 lines. The parser provides two interfaces. Most natural to Prolog is load structure(+Src, -DOM, +Options) which parses a Prolog stream into a term as described above. Alternatively, sgml_parse/2 provides an event-based parser making call-backs on Prolog for the SGML events. The call-back mode can deal with unbounded documents in streaming mode. It can be mixed with the term-creation mode, where the handler for begin calls the parser to create a term-representation for the content of the element. This feature is used to process long files with a repetitive record structure in limited memory.

Tags : , , , , , , , , , ,