Parsing and representing XML and HTML documents with SWI-Prolog

The core of the Web is formed by document standards and exchange protocols. Here we describe tree-structured documents transferred as SGML or XML. HTML, an SGML application, is the most commonly used document format on the Web. HTML represents documents as a tree using a fixed set of elements (tags), where the SGML DTD (Document Type Declaration) puts constraints on how elements can be nested. Each node in the hierarchy has a name (the element-name), a set of name-value pairs known as its attributes and content, a sequence of sub-elements and text (data).

XML is a rationalisation of SGML using the same tree-model, but removing many rarely used features as well as abbreviations that were introduced in SGML to make the markup easier to type and read by humans. XML documents are used to represent text using custom application-oriented tags as well as a serialization format for arbitrary data exchange between computers. XHTML is HTML based on XML rather than SGML. A stable Prolog term-representation for SGML/XML trees plays a similar role as the DOM (Document Object Model ) representation in use in the object-oriented world.

Some issues have been subject to debate.

  1. Representation of text by a Prolog atom is biased by the use of SWI-Prolog which has no length-limit on atoms and atoms that can represent Unicode text. At the same time SWI-Prolog stacks are limited to128 MB each. Using atoms only the structure of the tree is represented on the stack, while the bulk of the data is stored on the unlimited heap. Using lists of character codes is another possibility adopted by both PiLLoW and ECLiPSe. Two observations make lists less attractive: lists use two cells per character while practical experience shows text is frequently processed as a unit only. For (HTML) text-documents we profit from the compact representation of atoms. For XML documents representing serialized data-structures we profit from frequent repetition of the same value.
  2. Attribute values of multi-value attributes (e.g. NAMES) are returned as a Prolog list. This implies the DTD must be available to get unambiguous results. With SGML this is always true, but not with XML.
  3. Optionally attribute values of type NUMBER or NUMBERS are mapped to Prolog numbers. In addition to the DTD issues mentioned above, this conversion also suffers from possible loss of information. Leading zeros and different floating point number notations used are lost after conversion. Prolog systems with bounded arithmetic may also not be able to represent all values. Still, automatic conversion is useful in many applications, especially involving serialized data-structures.
  4. Attribute values are represented as Name=Value. Using Name(Value) is an alternative. The Name=Value representation was chosen for its similarity to the SGML notation and because it avoids the need for univ (=..) for processing argument-lists.

Implementation : The SWI-Prolog SGML/XML parser is implemented as a C-library that has been built from scratch to reach at a lightweight parser. Total sourceis 11,835 lines. The parser provides two interfaces. Most natural to Prolog is load structure(+Src, -DOM, +Options) which parses a Prolog stream into a term as described above. Alternatively, sgml_parse/2 provides an event-based parser making call-backs on Prolog for the SGML events. The call-back mode can deal with unbounded documents in streaming mode. It can be mixed with the term-creation mode, where the handler for begin calls the parser to create a term-representation for the content of the element. This feature is used to process long files with a repetitive record structure in limited memory.

Tags : , , , , , , , , , ,

If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.

Leave Comment