Parsing XML, Part 2

By: Eric Bohlman

Tightening Up some Loose Processing

In last month's column, I presented a simple program, psect, for converting outlines written in a simple XML markup language into printable text. I described the language thusly:

Conceptually, an outline consists of one or more sections, each of which consists of a sequence of paragraphs and nested sections. Each section has a title, and each paragraph is of course made up of characters.

Some of you may have noticed (in fact I know for certain that at least one reader did, because he emailed me to chide me for it!) that the program did not complain if given input that was well-formed XML, but didn't conform to the informal description above. For example, if you wrote an outline that included loose text outside a <para> element, psect would simply ignore the text. The reason for not including much, if any, "application-level validation" was, as you might have guessed, that I didn't want to overwhelm my readers with too much detail at once. But application-level validation is an essential part of XML processing, and now is the time to introduce it.

Formalizing structure

In previous columns I mentioned that it's possible to write a formal grammar, called a Document Type Definition (DTD) to describe the allowable element structure in an XML markup language, and that a validating parser can make use of it. While we're not using a validating parser, writing a DTD is still a valuable exercise in documenting the structure of the language and providing a guide that we can use when building application-level validation into our program.

An XML DTD is built up of markup declarations which are delimited by <! and >. There are several types of markup declarations allowed in XML (and even more in SGML), but for now we'll only be concerned with element type declarations and attribute list declarations.

An element type declaration gives the name of an element and its content model, which specifies what the element can contain. The content model is in some ways similar to a Perl regular expression. Let's create an element type declaration for our <outline> element:

<!ELEMENT outline (sect+)>

This formally states that <outline> is an element, and that it can contain one or more <sect> elements; nothing more and nothing less. The '+' means exactly the same thing as it does in a Perl regular expression, as do '*' and '?'; if we had written

<!ELEMENT outline (sect*)>

we would be saying that an <outline> could have zero or more <sect>'s. In fact, we'll change our informal spec to allow that so we don't have to write code to give a "<outline> was empty" error message.

Next we'll declare our <sect> and <para> elements:

<ELEMENT sect (para | sect)*>
<ELEMENT para (#PCDATA)>

The vertical bar has its familiar meaning of alternation. '#PCDATA' is a special term meaning "characters only, no markup." It actually stands for "parsed character data," "parsed" meaning that it's examined for the occurrence of markup, which terminates the run of text. Because of this, you must entify any occurrences of '<' or '&' that occur in your text by using the pre-defined character entities &lt; and &amp; in order to keep them from being taken as markup.

The content models for <outline> and <sect> specify what's called element content; those elements can contain only other elements, but not "loose" text. The content model for <para>, on the other hand, is called mixed content; text is allowed and it would have been possible to write the model so that a mixture of text and markup was allowed. In XML, mixed content models that allow both text and markup must take the form of zero or more occurrences of an alternation whose first alternative is #PCDATA. SGML allows a wider range of mixed-content models, but experience has shown that using anything but the restricted XML form leads to unpleasant surprises involving the interpretation of whitespace.

Speaking of whitespace, whitespace is exempt from the rule that text can't occur inside element content. Otherwise it wouldn't be possible to "prettyprint" an XML document, or even to compose one by hand in many cases. An XML parser is not allowed to ignore such whitespace; it must pass it to the application just as it would if it had occurred in mixed content. However, a validating parser is required to inform the application if a sequence of whitespace characters occurs within element content; such whitespace is commonly called "ignorable whitespace." A text-formatting application would probably choose to ignore this kind of whitespace, whereas an editor would probably want to keep track of it.

Finally, we need to declare the allowable attributes for <sect>:

<ATTLIST sect
  title CDATA #IMPLIED
>

An attribute list declaration can specify more than one attribute for an element; in our case we have only one, title. 'CDATA' stands for "character data" and in this context means plain text (which cannot contain an unentified '<'; this is rather confusing, since in SGML and in the XML construct known as a CDATA section, CDATA means "not parsed for markup"). '#IMPLIED' means that the attribute need not be supplied with the element and that the application will supply a default value. If we had instead written '#REQUIRED' it would be an error to leave out the attribute. We could also have written some ordinary text in place of '#IMPLIED'; in that case the attribute would have been optional, with the text becoming the default value if it wasn't supplied.

There's a great deal more to XML DTDs than can be covered here. You can find out more about them by reading the XML Recommendation or one of the many XML tutorials on the Web. One thing to keep in mind is that in the future, DTDs are not expected to be the only or even the primary method for specifying the structure of XML-based markup languages; there are currently several alternative proposals for an XML-based schema language for XML documents and data. These proposed languages allow the imposition of constraints beyond what DTDs can impose (for example, requiring that the text content of an element be a valid date) and allow one schema to inherit from another (something that can be done with DTDs but is rather clumsy and requires very careful planning).

Enforcing structure

If the revised version of psect is going to enforce the structure we just formally described, it's going to have to check the context in which elements and characters appear. How, you ask, can you do that with a stream-based parser? The answer is that XML::Parser provides some limited facilities for examining context. In particular, it lets you find out what elements the current element is nested inside. As you know, the first argument that XML::Parser passes to a callback function is a reference to a parser object. This object is actually of the type XML::Parser::Expat rather than of XML::Parser itself, and a perusal of the POD for XML::Parser::Expat shows several methods for examining and testing context. The in_element and within_element methods are especially useful for context validation.

Here's psect1, the revised version of psect with error checking:

#!/usr/bin/perl -w
use strict;

use XML::Parser;
use Text::Wrap;
use Getopt::Std;

my ($indlevel,@sectnums,$parabuf,%opts);

getopts('c',\%opts);
die "usage: psect [-c] file\n" unless @ARGV==1;
my $p=new XML::Parser(Style=>'Stream',ErrorContext=>2);

The ErrorContext argument tells the parser to show the specified number of lines before and after the place where an error occurs.

$p->parsefile($ARGV[0]);

sub StartTag {
  my ($expat,$eltype)=@_;
  if ($eltype eq "outline") {
    $expat->xpcarp('<outline> can\'t occur here') if $expat->depth();

The xpcarp method lets us issue a warning message displayed in context. The depth method tells us how many elements deep we're nested; the nesting level doesn't get incremented until after we've been called to process the start tag, so this test rejects any <outline> elements other than the single top-level one.

In a real-life application, we wouldn't continue to plow ahead and repeat our initialization like this.

    $indlevel=-1;
    $sectnums[0]=0;
    $parabuf="";
    return;
  }
  $expat->xpcroak('<outline> must be the top-level element') 
     unless $expat->within_element('outline');   

If our document isn't enclosed in an <outline> element, we might as well give up (since otherwise we'd have to give this message for every subsequent element). xpcroak is like xpcarp except that it dies instead of just warning. within_element checks to see if the specified element exists anywhere in the list of currently nested elements.

  if ($eltype eq "sect") {
    if (not ($expat->in_element('outline') or $expat->in_element('sect'))) {
      $expat->xpcarp('<sect> can\'t occur here');
    }

in_element tests whether the specified element is the immediately enclosing element. This test basically complains about any <sect>s inside <para>s.

    ++$sectnums[++$indlevel];
    $sectnums[$indlevel+1]=0;
    print ' ' x (4*$indlevel),join('.',@sectnums[0..$indlevel])," $_{title}\n";
    print "\n" unless $opts{c};
  } elsif ($eltype eq "para") {
      $expat->xpcarp('<para> can\'t occur here') 
	unless $expat->in_element('sect');
  } else {
    $expat->xpcarp("invalid element: <$eltype>");
  }
}

sub Text {
  my $expat=shift;
  tr/\n/ /;
  s/^\s+//;
  s/\s+$//;
  return if $_ eq "";
  $expat->xpcarp('text not allowed outside <para>') unless $expat->in_element('para');

Note that we can't do the test until after we skip over pure- whitespace text; otherwise we'd be complaining about ignorable whitespace.

  $parabuf=$_;
}

sub EndTag {
  my ($expat,$eltype)=@_;
  if ($eltype eq "outline") {
  } elsif ($eltype eq "sect") {
    --$indlevel;
  } elsif ($eltype eq "para") {
    my $ind=' ' x (4*$indlevel);
    print wrap($ind,$ind,$parabuf),"\n\n" unless $opts{c};
  }
}

We didn't have to do any checking when processing the end tags because the parser itself would have complained that the document wasn't well- formed if they were out of sequence. As mentioned last month, a typical stream-based application will do mostly validation and initialization in response to start tags, buffering in response to characters, and the actual work in response to end tags.

Tree-Based Parsing

Last month I talked about several modules designed specifically for tree-based parsing, and mentioned in passing that XML::Parser provided two tree-based parsing styles. The 'Tree' style causes the parser to construct a very simple tree structure whose nodes consist of references to Perl arrays. There are two types of nodes:

As you can see, all the arrays except the one for the root node have an odd number of elements. When using the 'Tree' style, XML::Parser's parse and parsefile methods return a reference to the root node.

Although the documentation for the 'Tree' style in recent versions of XML::Parser is much clearer than the early versions, understanding the data structures involved isn't the easiest thing in the world. Don't feel bad if you didn't fully understand the description above; after looking at the sample code below, re-reading the documentation, looking at the code again, and repeating the process several times, you'll eventually have an 'AHA!' moment.

psect Revisted: multiple passes with a tree

One reason for using a tree-based parser rather than a stream-based one is that you have to make multiple passes over the document's contents. Let's change the specification for psect so that instead of printing either a table of contents or a full outline, it prints both: a TOC followed by an outline. Now we could do this with a stream-based parser by simply gathering up the output in two different buffers and then printing them at the end, but that wouldn't be any fun. Instead, we'll build a tree and then make two passes over it, printing out the TOC (and validating structure) on the first pass, and printing out the full outline on the second pass. Here's psect2:

#!/usr/bin/perl -w
use strict;

use XML::Parser;
use Text::Wrap;

my ($indlevel,@sectnums);

die "usage: psect file\n" unless @ARGV==1;
my $parser=new XML::Parser(Style=>'Tree',ErrorContext=>2);
my $tree=$parser->parsefile($ARGV[0]);
die '<outline> isn\'t the top-level element' unless $tree->[0] eq 'outline';

If everything went OK, $tree is a reference to a 2-member array whose first member is 'outline' (the name of our root element) and whose second member is a reference to an array containing its contents.

$indlevel=-1;
$sectnums[0]=0;
for my $pass (0..1) {
  my $p=$tree->[1];
  for (my $i=1;$i<@$p;$i+=2) {
    if (not $p->[$i]) {
      warn 'text not allowed here' unless $pass or $p->[$i+1]=~/^\s*$/;
      next;
    }
    warn "<$p->[$i]> not allowed in <outline>" unless $pass or $p->[$i] eq 'sect';
    process_sect($p->[$i+1],$pass);
  }
}

We skip over the attribute hash for <outline> (since it's not supposed to have any attributes and we're going to ignore any spurious ones). We walk over the remaining members in pairs; if the document was valid, they'll consist either of empty text nodes, which we skip, or <sect> nodes, whose contents we pass to process_sect; we need to use a sub here because a <sect> element can contain nested <sect>s, and the easiest way to handle this is with recursion.

sub process_sect {
  my ($p,$pass)=@_;
  my ($href,$ind);
  if ($pass==0)
   {++$sectnums[++$indlevel];
    $sectnums[$indlevel+1]=0;
    $href=join('.',@sectnums[0..$indlevel]);
    $ind=$indlevel;
    $p->[0]{_href}=$href;
    $p->[0]{_ind}=$ind;
    print ' ' x (4*$ind),$href," $p->[0]{title}\n";
   }

On the first pass, we compute the section number and indentation level and then store them as extra attributes of the <sect> element node; its attributes are found in the hash referenced by the first member of the content array. Dynamically updating a parse tree like this is a powerful technique; it reduces the number of data structures you need to create and keep track of. We then print the number and title for the TOC.

  else {
    $href=$p->[0]{_href};
    $ind=$p->[0]{_ind};
    print ' ' x (4*$ind),$href," $p->[0]{title}\n\n";
  }

On the second pass, we retrieve the information we squirreled away on the first pass, and use it to print the section header for the main outline.

  for (my $i=1;$i<@$p;$i+=2) {
    if (not $p->[$i]) {
      warn 'text not allowed outside <para>' unless $pass or $p->[$i+1]=~/^\s*$/;
      next;
    }
    if ($p->[$i] eq 'sect') {
      process_sect($p->[$i+1],$pass);

We now loop through the children of our <sect>. If we find another <sect> we process it recursively.

    } elsif ($p->[$i] eq 'para') {
      my $p1=$p->[$i+1];
      if ($p1->[1]) {
        warn "<$p1->[1] not allowed in <para>" unless $pass;
      }

The first member of the content node for a <para> element is an attribute hash which we ignore. The next two members should be a text node (to keep the code short, we aren't bother to make sure that the text node is the only child node.

      else {
        if ($pass) {
          $_=$p1->[2];
          tr/\n/ /;
          s/^\s+//;
          s/\s+$//;
          my $indent=' ' x (4*$ind);
          print wrap($indent,$indent,$_),"\n\n";
        }

If this is the second pass, we retrieve the text and print it out just like we did in the streaming version, except that the printing and the stripping of leading/trailing whitespace are now done in the same place.

      }
    } else {
      warn "<$p->[$i]> not allowed in <sect>";
    }

Oops, we ran into something that wasn't a <para>, a <sect>, or ignorable whitespace.

  }
  --$indlevel;
}

An exercise for the reader

Modify psect2 to create an HTML document in which the TOC entries are internal links to the corresponding sections.

The 'Objects' style and various kinds of fruit

The other XML::Parser style that creates a tree structure is 'Objects.' When you use it, the parser creates a tree of Perl objects in a package whose name you specify, with classes named after the element names in your document. Each object includes a field called 'Kids' that contains a reference to any child objects. Text nodes are stored as objects in the Characters class.

Here's an amusing little program that interprets an XML-encoded decision tree and asks you questions to identify a fruit. I'm leaving it as an exercise to the reader to figure out how it works and how the markup language is structured (you may find it helpful to read the source code to the 'Objects' section of XML::Parser). Feel free to play around with it and create your own decision trees. I wrote it a while back to teach myself how to use the 'Objects' style. It's a good example of how XML can be used for purposes other than storing documents to be rendered.

#!/usr/bin/perl -w
use strict;
use XML::Parser;

my $text;
{local $/;
 $text=<DATA>;
}
my $p1=new XML::Parser(Style=>'Objects',Pkg=>'DTree');
my $tree=$p1->parse($text);
my $res=$tree->[0]->process();
if (defined $res) {
  print "The result is: $res\n";
} else {
  print "No result was found\n";
}

package DTree;

# return index of next element that's not all whitespace

sub skipws {
  my ($p,$i)=@_;
  ++$i while $i<@$p && $p->[$i]->isa('DTree::Characters')
             && $p->[$i]{Text} =~ /^\s*$/;
  return $i;
}

package DTree::dtree;

sub process {
  my $self=shift;
  my $p=$self->{Kids};
  my $i=DTree::skipws($p,0);
  my $q=$p->[$i];
  die "Dtree doesn't start with question" unless $q->isa('DTree::question');

# build list of available choices
  my $i1=$i;
  my @qvs;
  while (($i=DTree::skipws($p,$i+1)) < @$p) {
    my $c=$p->[$i];
    die "Not a choice" unless $c->isa('DTree::choice');
    push @qvs,$c->{value};
  }

  my $qval=$q->process(@qvs);
  $i=$i1;
  while (($i=DTree::skipws($p,$i+1)) < @$p) {
    my $c=$p->[$i];
    return $c->process if $c->match($qval);
  }
  return undef;
}

package DTree::question;

sub process {
  my ($self,@qvs)=@_;
  my $p=$self->{Kids}[0];
  die "Question doesn't have text" unless $p->isa('DTree::Characters');
  print $p->{Text},' [',join(', ',@qvs),']? ';
  my $qv=<STDIN>;
  chomp $qv;
  return $qv;
}

package DTree::answer;

sub process {
  my $self=shift;
  my $p=$self->{Kids}[0];
  die "Answer doesn't have text" unless $p->isa('DTree::Characters');
  return $p->{Text};
}

package DTree::choice;

sub match {
  my ($self,$qv)=@_;
  return(lc($self->{value}) eq lc($qv));
}

sub process {
  my $self=shift;
  my $p=$self->{Kids};
  my $q=$p->[DTree::skipws($p,0)];
  die "Choice doesn't have answer or subtree" 
      unless $q->isa('DTree::answer') || $q->isa('DTree::dtree');
  return $q->process;
}

__END__
<dtree>
  <question>What's its shape</question>
  <choice value="round">
    <dtree>
       <question>What's its color</question>
       <choice value="green">
         <answer>Lime</answer>
       </choice>
       <choice value="yellow">
          <dtree>
            <question>Is it big</question>
            <choice value="no"> 
              <answer>Lemon</answer>
            </choice>
            <choice value="yes">
              <answer>Grapefruit</answer>
            </choice>
          </dtree>
       </choice>
       <choice value="orange">
         <answer>Orange</answer>
       </choice> 
    </dtree>
  </choice>
  <choice value="elliptical">
    <dtree>
       <question>Is it yellow</question>
       <choice value="yes">
         <answer>Banana</answer>
       </choice>
    </dtree> 
  </choice>
</dtree>

Where We're Going Next

This month's installment pretty much wraps up our basic coverage of XML::Parser. I strongly recommend that you read the POD for XML::Parser and XML::Parser::Expat until it all makes sense to you, and that you try implementing some of your own small XML applications.

Next month we'll be starting, and perhaps finishing, an introduction to XML::DOM. Once we finish that, we'll take a look at some Perl modules that treat XML documents and data at a higher level than just parsing them, and at some modules for creating XML documents. We'll also look at modules for querying and extracting information from XML documents. The XML world is moving so rapidly that I have no idea what we'll be covering after that. Let me know what XML topics you want to see covered here.

Another Valuable XML Resource

Robin Cover's SGML/XML Page is mandatory reading for anyone who wants to keep up with the developments in the XML community.