Parsing XML, Part 1

By: Eric Bohlman

Language Meets Parser

When we talk about parsing a language, we mean the process of taking a piece of code or data written in that language and breaking it up into its constituent parts as defined by the rules of that language. Parsing is an essential task for any program that wants to use language- based data or code as input. In our last column, we talked about how "newline-separated records, with fields separated by commas" was actually the specification for a very simple language. In Perl, we'd normally parse it using something like:

while (<INPUT>) {
  @fields=split /,/;
  # do something with the fields
}

The above code works by imposing structure on the stream of characters contained in the file we're reading. In this case, the structure is mostly implicit; the information that the first field of each record is, say, someone's last name and the second field is that person's first name is not explicitly encoded within the file. It's up to our program to know which is which.

How Not to Parse XML

Beginning and even intermediate Perl programmers, when given a data format like HTML or an XML application, immediately think "let's parse it by reading it in line-by-line and using regular expressions to match tags." This is not a good way to parse a structured markup language! First of all, line boundaries carry no structural meaning in most SGML/XML applications; they exist merely to make editing convenient. The structures you're interested in will almost certainly span line boundaries.

Second, anyone who's read the Perl FAQs (that's all of you, right?) knows that Perl regular expressions, while more powerful than strict "mathematical" regular expressions, can't deal with arbitrarily nested parentheses. "What does this have to do with XML?" you ask. Well, it turns out that XML (or SGML) tags are merely parentheses in drag. Parsing XML is mathematically equivalent to parsing nested parenthesized lists, which requires the capabilities of a push-down automaton (a finite-state machine with a stack) rather than just a finite-state machine.

Third, the XML Recommendation specifies an inclusion mechanism that allows a logical XML document to be made up of several separate physical parts, known as entities (beginning HTML authors are always asking whether there's a tag that lets you include common header or footer information inside a document. SGML, upon which HTML is based, provides the same entity-inclusion mechanism, but unfortunately very few HTML user agents implement it). Nobody wants XML applications to suffer from the same problems of incomplete implementation. As it turns out, handling entity inclusion requires lots of simple but tedious code.

The Right Way to Parse XML

All these considerations mean that writing an XML parser is a demanding task not for the faint of heart. One of the W3C's original design goals for XML was that someone with a degree in computer science could write an XML parser in a week. That turned out to be one of the goals that wasn't quite achieved, and the development of new XML facilities like namespaces makes it even less realistic.

The complexity of XML parser design has not prevented many talented individuals and organizations from writing XML parsers, many of which are available for free. Thus, nearly everyone who wants to write programs that use XML for input will want to use a pre-written parser to do all the dirty work. In the Perl world, this means using parsers that have been packaged as Perl modules.

Types of parsers

All XML parsers break an XML document down into the following components:

XML parsers can be classified along two independent dimensions: validating vs. non-validating and stream-based vs. tree-based.

Validating and non-validating parsers

In our last column, we mentioned that SGML parsers require a special grammar, called a Document Type Definition (DTD) in order to parse a document. The DTD lists the allowable element types used in the document, the allowable attributes, if any, for each element, and describes the way elements can be nested (for example, the DTD for HTML specifies that a <td> element can occur only inside a <tr> element).

Due to the design of XML, it's possible to parse a well-formed XML document without referring to a DTD. The requirements for XML to be well-formed are listed in the XML Recommendation; among them are that all elements have both start and end tags, that elements nest rather than overlap, and that each document have only one top-level element.

While DTDs are not required for XML documents, a validating parser can use a DTD to verify that an XML document is properly constructed according to the set of rules for the XML application it's an instance of, and is supposed to complain loudly if the rules aren't followed. This can be quite important when you're processing XML documents that you've received from the Outside World; if vendors were sending XML- encoded invoices to your company, you'd want to ensure that they contained the right elements in the right order. A DTD can also specify default values for the attributes of various elements, and a validating parser can automatically "fill them in" when it encounters elements with no attributes listed.

A non-validating parser, on the other hand, only requires that the document it's parsing be well-formed. Non-validating parser are much simpler than validating parsers, and most of the free parsers are non- validating. Non-validating parsers are usually perfectly adequate for processing XML documents generated within the same organization, or documents whose validity constraints are complex enough that they can't be expressed by a DTD and have to be verified by application logic. As of this writing, all the non-experimental Perl XML parsers are non- validating.

Stream-based and tree-based parsers

There are basically two ways a parser can make the components of an XML document known to an application: it can read through the document and signal the application every time a new component appears, or it can read the entire document and then present the application with a tree structure corresponding to the element structure of the document. A parser that works the first way is called a stream-based or event- driven parser; one that works the second way is called a tree- based parser. Two common terms that you'll hear are SAX and DOM; SAX (Simple API for XML) is a specification (developed informally by members of the xml-dev mailing list) for how a stream-based parser should "talk" to an application; DOM (Document Object Model) is a specification (a formal Recommendation of the W3C) for how an application can access and manipulate the tree structure of a document.

Whether to use a stream-based or tree-based parser depends on the nature of the processing being done to the XML documents and the size of the documents. A tree-based parser usually has to load the entire document into memory, which may be impractical when processing documents like dictionaries or large database dumps. With a stream-based processor, you can skip over elements that you aren't interested in (for example, when looking up a particular word in a dictionary). But if your application needs to process certain elements in relation to other elements (for example, reading a bibliography and extracting a list of all authors who have published at least three articles on the same topic), a tree- based parser is much easier to work with.

It's worth noting that a tree-based parser can be built on top of a stream-based parser, and that the output of a tree-based parser can be "walked" to provide a stream-based interface to an application. As of this writing, all the Perl tree-based parsers are of the former type.

Perl Modules for Parsing XML

As of this writing, there are four "mainstream families" of XML parsers available as Perl modules, all of which are available from CPAN:

XML::Parser

This is the granddaddy of all Perl XML parsers. It makes use of James Clark's C-based Expat parser, and was originally written by none other than Larry Wall! Clark Cooper now maintains it.

XML::Parser is fundamentally a stream-based, non-validating parser; it provides a "native" interface based on associating callback subroutines with events in the document, as well as several extended "styles," two of which build tree structures.

XML::DOM

This is a tree-based parser implementing the W3C's DOM Level 1 interface, written by Enno Derksen. It's actually built on top of XML::Parser.

XML::Parser::PerlSAX

This is a relatively new stream-based parser implementing the SAX interface, written by Ken MacLeod. Because of SAX's Java heritage, the interface isn't quite as "Perlish" as XML::Parser.

XML::Grove

This is a tree-based parser, also written by Ken MacLeod. While XML::DOM requires you to use method calls to walk the trees it builds, XML::Grove creates trees that can be traversed using standard Perl array operations. This makes it faster for applications that have to do lots of traversal (method calls in Perl are rather expensive).

All of these modules provide object-oriented interfaces; if you're not comfortable with object-oriented programming in Perl, now is the time to review the perltoot and perltootc manpages that come with Perl. If you have ActiveState's ActivePerl for Win32, you already have XML::Parser installed, since ActivePerl's PPM utility uses XML documents to store the installation requirements for modules. In a later column we'll cover this use of XML in greater depth.

And Now For a Real Application Example

Enough talk about what's out there, now let's start using it! One of the possible uses of XML, but by no means the only one, is to describe the structure of a document that's intended to be rendered for human consumption, without tying it down to a specific mode of rendering. For this example, we'll design an extremely simply XML-based markup language for describing outlines, and then we'll write a (surprisingly short) program to take a document and either render it with appropriate numbering and indentation, or display just the table of contents.

A markup language for outlines

Conceptually, an outline consists of one or more sections, each of which consists of a sequence of paragraphs and nested sections. Each section has a title, and each paragraph is of course made up of characters. Our first task is to translate these requirements into XML terms. Clearly an outline should correspond to an element, in fact the top-level element in our language. This means that a document is going to look like:

<outline>
<!-- some stuff inside -->

Each section also needs to be an element, which we'll call <sect>..</sect>. We could make the section's title another element, which would nest inside <sect>, but in this case we'll make it an attribute of <sect>. Finally, we'll mark up paragraphs with a <para> element. So a complete outline document would look like:

<outline>
  <sect title="first">
    <para>This is the first section</para>
    <sect title="first subsection">
      <para>This is the first subsection
of the first section
      </para>
      <sect title="first sub-subsection">
      <para>This is what the title says it is</para>
      <para>And this is the second paragraph</para>
      </sect>
      <para>This is more text for the first
subsection.  Doesn't this look good?
      </para>
    </sect>
    <sect title="second subsection">
    <para>This is the text of the second
subsection of the first section
    </para>
    </sect>
  </sect>
  <sect title="second">
    <para>This very simple program was designed
to demonstrate stream-based parsing of XML documents.
It reads a file written according to a very simple XML
vocabulary and outputs a text file with appropriately
numbered and indented sections.
    </para>
  </sect>
</outline>

A program for rendering outlines

Now, we want to write a program that can take a document like the above example and render it as:

1 first

This is the first section

    1.1 first subsection

    This is the first subsection
    of the first section

        1.1.1 first sub-subsection

        This is what the title says it is

        And this is the second paragraph

    This is more text for the first
    subsection.  Doesn't this look good?

    1.2 second subsection

    This is the text of the second
    subsection of the first section

2 second

This very simple program was designed to demonstrate stream-based parsing
of XML documents. It reads a file written according to a very simple XML
vocabulary and outputs a text file with appropriately numbered and indented
sections.

or

1 first
    1.1 first subsection
        1.1.1 first sub-subsection
    1.2 second subsection
2 second

Pay close attention to the fact that nowhere in the document do we specify how the sections and subsections are to be numbered or indented; we simply indicate where they are. This illustrates the concept of separating content from presentation, which is a major benefit of using structured markup languages.

Because this kind of processing doesn't require random access to the document elements, we'll use the stream-style XML::Parser:

#!/usr/bin/perl -w
use strict;

You always use -w and strict, don't you?

use XML::Parser;
use Text::Wrap;
use Getopt::Std;

my ($indlevel,@sectnums,$parabuf,%opts);

$indlevel will hold our nesting level, @sectnums will be an array of section numbers, and $parabuf will hold the text of a paragraph.

getopts('c',\%opts);

If the 'c' option is specified on the command line, we'll just print a table of contents

die "usage: psect [-c] file\n" unless @ARGV==1;
my $p=new XML::Parser(Style=>'Stream');

We create a new parser object, and we specify that we want the "Stream" callback style. This is described in detail in the POD for XML::Parser; the main reason we're using it here instead of the default "handler" interface is that the Stream style makes a single callback for a run of characters, which simplifies buffering.

$p->parsefile($ARGV[0]);

Now we give the parser object a file and tell it to do its thing. All the real work will take place in the callbacks

sub StartTag {
  my ($expat,$eltype)=@_;

When using the 'Stream' style, the parser calls the sub 'StartTag' whenever it sees the start tag for an element. The first argument is a reference to the parser object, and the second is a string containing the element type name. %_ is set to the attribute-value pairs, if any, and $_ contains the complete start tag including the angle brackets.

  if ($eltype eq "outline") {
    $indlevel=-1;
    $sectnums[0]=0;
    $parabuf="";

We do some initialization (some of which isn't really necessary) when we see the start of the top-level element. We could also have done it in a 'StartDocument' sub which, if it exists, gets called when the parser begins processing the document.

  } elsif ($eltype eq "sect") {
    ++$sectnums[++$indlevel];
    $sectnums[$indlevel+1]=0;
    print ' ' x (4*$indlevel),join('.',@sectnums[0..$indlevel]), \
    " $_{title}\n";
    print "\n" unless $opts{c};

When we see a <sect> tag, we bump the nesting level, increment the (sub)section number at our current nesting level, and zero out the subsection number for any sub-sections. Then we grab the section's title from the title attribute and print it out with the appropriate indentation and numbering

  } elsif ($eltype eq "para") {

In this example, we don't have to do anything at the start of a paragraph; in a more sophisticated program, we might have had to do some initialization.

  } else {
    die "invalid element: $eltype";
  }

Since we're using a non-validating parser, we have to do at least a little checking to make sure that we're only using elements defined in the language.

}

sub Text {
  tr/\n/ /;
  s/^\s+//;
  s/\s+$//;
  return if $_ eq "";
  $parabuf=$_;
}

The parser calls the 'Text' sub when it sees a run of characters. $_ contains the text up to the next start tag, end tag, processing instruction, or comment. An XML parser is required to pass all whitespace in the document on to the application, so we remove leading and trailing whitespace and ignore any all-whitespace text, which in this case comes from the whitespace used to make the document easier to edit. We translate any newlines into single spaces because we're going to reformat the paragraphs.

Note that we're making the assumption, just to simplify things, that the only text we're interested in occurs inside <para> elements. This program will silently ignore non-whitespace text outside <para>s.

sub EndTag {
  my ($expat,$eltype)=@_;
  if ($eltype eq "outline") {
  } elsif ($eltype eq "sect") {
    --$indlevel;
  } elsif ($eltype eq "para") {
    my $ind=' ' x (4*$indlevel);
    print wrap($ind,$ind,$parabuf),"\n\n" unless $opts{c};
  }

We do half our work when we encounter the end tags. We wrap and print our paragraphs when we see the end tag for a <para>. This is fairly typical for a stream-based application; we can't do much with an element until we've seen the end of it.

}

Wrapping it up

Well, that's it for this month's column. Next month we'll continue talking about XML parsing and we'll look at tree-based parsing.

You can download the code for the psect program and the sample outline and play around with them. Remember that in XML, element names are case-sensitive; <sect> and <SECT> are the start tags for two different elements, the latter of which isn't defined in our example markup language.