One of the most popular methods for parsing and accessing the contents of an XML document is the Document Object Model (DOM) defined by the W3C. A DOM parser translates an XML document into an object model that allows a program to randomly access and alter sections of the document. By an object model, we mean that the parser creates an object, called a document object that represents the document itself; like any object, it has properties, which in this case represent the constituents of the document, and methods for accessing or altering those properties. Each constituent property is itself an object, and can have other objects as its properties. The containment (not inheritance) hierarchy of the object model corresponds to the hierarchical structure of the document.
The DOM was originally conceived as an API for allowing scripts embedded in an HTML document to access and alter the document in order to create "dynamic HTML" effects in a Web browser. The model also proved to be an excellent way to model XML documents, leading the W3C to issue a Recommendation standardizing the DOM interface. Nearly every XML parser provides a DOM interface; in fact, the DOM is the only interface offerred by Microsoft's XML-processing tools. The DOM provides both read and write access to documents, and can also be used to create new documents from scratch.
The DOM as defined by the W3C is an abstract set of language-independent interfaces. Implementing the DOM in a specific programming language involves defining a set of language bindings which specify how a program makes use of those interfaces. For example, Java has a standardized syntax for referring to an object's properties, whereas Perl doesn't. Therefore, the language bindings for Java use Java's native syntax to access properties, whereas the bindings for Perl use methods to return or set properties.
The current W3C DOM Recommendation is for what's called Level 1 of the DOM. The W3C is currently working on the specifications for Level 2, which will add methods for filtering and advanced navigation to the model. The current Perl DOM implementation, Enno Derksen's XML::DOM, supports only Level 1, as do most production DOM implementations. XML::DOM provides a number of facilities that are not present in the W3C recommendation; for simplicity, this article will not cover them.
All the objects in the DOM inherit from (implement the interfaces of) three main classes: Node, NodeList, and NamedNodeMap. A NodeList is conceptually similar to an array of Nodes; it is an ordered list of Nodes accessed by item number. For example, you can obtain a NodeList representing an element's children. A NamedNodeMap is conceptually similar to a hash of nodes; it is an unordered list of Nodes accessed by name. For example, you can obtain a NamedNodeMap representing an element's attributes.
Each constituent of a document is represented by an object that inherits from Node. A Document node represents the entire document. An Element node represents an element, an Attr node represents a single attribute of an element, and a Text node represents a run of characters. There are also nodes for such things as comments and processing instructions. The DOM makes good use of inheritance and polymorphism; for example, every method defined for Node can be used on an Element object, and a NodeList can contain both Element objects and Text objects. If you're a little (or more than a little) confused by OO concepts, now would be a good time to review them.
A concrete example of a DOM representation will (hopefully!) make things clearer. Let's take the sample outline document we've been working with in the last two columns. A DOM built from this document will consist of a Document node with one child, an Element node whose name is 'outline'. If our sample document had included comments or processing instructions that were outside the <outline> element, the Document node would also have Comment nodes or ProcessingInstruction nodes as children. Due to the rule that a well-formed XML document must have exactly one root element, a Document node can have only one Element child.
The root Element node, in turn, has two Element children, each named 'sect'. The first of these Elements has as one of its properties an Attr node named 'title', which in turn has a Text child whose value is 'first'. It also has a 'para' Element child and two 'sect' Element children.
All Node objects have methods that let you navigate from node to node in the tree represented by the DOM. From a node, you can get its children, parents or siblings. You can also obtain a list (implemented as a NodeList) of Element nodes anywhere in the document that have a particular element-type name (i.e., tag name). This means that with the DOM, you are not restricted to a simple top-down traversal of an element tree like you are with the Tree style of XML::Parser.
As you can see, the DOM provides an extremely flexible and powerful interface to XML documents. Its built-in methods relieve you from writing much of the state-maintenance and bookkeeping code that you need when using a stream-style parser or a simpler tree-style parser. Why, then, would anybody want to use anything other than the DOM to parse XML documents? Maybe the folks in Redmond are on to something?
Well, the catch is that the DOM is an extremely heavyweight way to represent an XML document. A DOM representation of a document takes up many times as much memory as the document itself. Parsing a document into a DOM representation is not terribly fast, and navigating the parsed document involves lots of method calls, which are quite expensive in Perl. It would not, for example, be a Good Idea for a traditional process-per-hit CGI script to build a DOM representation of a document every time it's invoked; that would usually lead to unacceptable response times and server loads.
Nonetheless, there are often times when using the DOM will greatly simplify your application logic. If you have an application where speed is critical, but the document changes infrequently compared to the number of times the application is run (e.g. a CGI script that runs thousands of times a day, accessing a document that's updated once per day), you might want to consider creating a simplified data structure from the information extracted from the parsed document and saving it using something like Storable. Your application would then compare the last-modification times on both the original document and the saved structure file, and would reparse the document only if the saved structure file was out of date. Otherwise it would just load the saved structures; unless you're working with enormous documents, this technique should reduce waiting times and processor loads to an acceptable level.
As you might have guessed, we're going to rework last month's outliner example to use XML::DOM rather than XML::Parser's Tree style. We're going to be a little sloppy with our error-checking, just to simplify things. Here's psect3:
#!/usr/bin/perl -w use strict; use XML::DOM; use Text::Wrap; my ($indlevel,@sectnums); die "usage: psect file\n" unless @ARGV==1; my $parser=new XML::DOM::Parser; my $doc=$parser->parsefile($ARGV[0]) or die "unable to parse document";
Since XML::DOM inherits from XML::Parser, it has the same parse() and parsefile() methods. If the parse was successful, the return value is a reference to a Document node.
my $root=$doc->getFirstChild; die 'isn\'t the top-level element' unless $root->getTagName eq 'outline';
We want our document element to have a single child, an Element node of type 'outline'. If we were going to allow comments or processing instructions, we'd need to get a list of children and process the first one that was an Element node.
$indlevel=-1;
$sectnums[0]=0;
for my $pass (0..1) {
my $nl=$root->getChildNodes;
for my $i (0..$nl->getLength-1) {
my $p=$nl->item($i);
We obtain a NodeList containing the children of our root element. The nodes in the list are numbered starting from zero. We access each child using the NodeList's item() method; the getLength method tells us how many nodes are in the list.
For some odd reason, most people who write examples of walking a NodeList feel compelled to use a C-style for loop rather than iterating over a list of numbers. I find the iterate-over-the-list style easier to read, but your mileage may vary.
if ($p->getNodeType==TEXT_NODE) {
warn 'text not allowed here' unless $pass or $p->getData=~/^\s*$/;
next;
}
getData() is a method specific to Text nodes. In other languages, you'll often see getNodeValue() used instead because it avoids the need for a cast; see the entry for getNodeValue() in the POD for XML::DOM.
warn $p->getNodeName.' not allowed in <outline>'
unless $pass or ($p->getNodeType==ELEMENT_NODE and $p->getTagName eq 'sect');
process_sect($p,$pass);
We pass the Element node for the <sect> on to process_sect
}
}
sub process_sect {
my ($sectnode,$pass)=@_;
my ($href,$ind);
if ($pass==0)
{++$sectnums[++$indlevel];
$sectnums[$indlevel+1]=0;
$href=join('.',@sectnums[0..$indlevel]);
$ind=$indlevel;
$sectnode->setAttribute('href',$href);
$sectnode->setAttribute('ind',$ind);
print ' ' x (4*$ind),$href,' ',$sectnode->getAttribute('title'),"\n";
Here we use the setAttribute() and getAttribute() methods of Element. These take and return strings. Element also offers getAttributeNode() and setAttributeNode() methods, which work with Attr nodes. It's easy to confuse the two set of methods, so don't.
}
else {
$href=$sectnode->getAttribute('href');
$ind=$sectnode->getAttribute('ind');
print ' ' x (4*$ind),$href,' ',$sectnode->getAttribute('title'),"\n\n";
}
my $children=$sectnode->getChildNodes;
for my $i (0..$children->getLength-1) {
my $p=$children->item($i);
my $t=$p->getNodeType;
if ($t==TEXT_NODE) {
warn 'text not allowed outside <para>' unless $pass or $p->getData=~/^\s*$/;
next;
}
warn $p->getNodeName.' not allowed in <sect>' unless $t==ELEMENT_NODE;
if ($p->getTagName eq 'sect') {
process_sect($p,$pass);
} elsif ($p->getTagName eq 'para') {
my $p1=$p->getFirstChild;
getFirstChild() does exactly what its name implies. Here we're blindly assuming that the only child of our <para> is a single text node.
if ($p1->getNodeType!=TEXT_NODE) {
warn $p1->getNodeName.' not allowed in <para>' unless $pass;
}
else {
if ($pass) {
$_=$p1->getData;
tr/\n/ /;
s/^\s+//;
s/\s+$//;
my $indent=' ' x (4*$ind);
print wrap($indent,$indent,$_),"\n\n";
}
}
} else {
warn '<',$p->getTagName,'> not allowed in <sect>';
}
}
--$indlevel;
}
Since all Nodes implement getFirstChild() and getNextSibling() methods, a loop like
my $children=$sectnode->getChildNodes;
for my $i (0..$children->getLength-1) {
my $p=$children->item($i);
# ...
}
can also be written as
for (my $p=$sectnode->getFirstChild;defined $p;$p=$p->getNextSibling) {
# ...
}
which doesn't make use of any NodeLists
A frequent beginner's mistake is to call getNodeValue on an Element node and expect to get the characters contained in the element. If you do this, you'll get a return of undef, because an Element node doesn't have a NodeValue; only Text, Attr, CDATASection, Comment, and ProcessingInstruction nodes have one. The text contained in an element is represented as a child of the Element node.
XML::DOM implements a few non-standard extensions to methods that normally return NodeLists. If a method like getChildNodes is called in list context, it will return an ordinary Perl list of child nodes. Using this feature will speed up the traversal of long lists of nodes, but it's not portable to other languages. Enno also implemented several methods that aren't in the DOM Recommendation; some of them are convenience methods, while others address issues not covered in the Recommendation, such as serializing a generated DOM tree or subtree into XML for storage.
Our previous example didn't really exploit the DOM's random-access capabilities; all it did was a depth-first traversal of the document tree. But the DOM implements a powerful method called getElementsByTagName() which returns a NodeList containing all descendant elements of a particular node that have a specific element type. Once you've got an Element node from this list, you can use Node's standard navigation methods to access its parent, siblings or descendants. This is very useful in filter or query type applications.
Let's say we have an XML document listing a number of events organized by the group sponsoring them. There's a <sponsor> element for each group, containing a <name> element and one or more <event> elements, each of which in turn contains a <name>, a <date>, and a <description>. <name>, <date>, and <description> all contain character values; we're not using any attributes. Note that <name> serves two purposes: specifying the name of a sponsor and specifying the name of an event. An example file looks like:
<events>
<sponsor>
<name>The Coalition Against Bubblegum</name>
<event>
<name>Rallly for BBNPT</name>
<date>1999-11-01</date>
<description>
Rally at United Nations Plaza in support of the US signing the Boy-Band
Non-Proliferation Treaty. Food and drink provided.
</description>
</event>
<event>
<name>Molten Spears art exhibit</name>
<date>1999-11-20</date>
<description>
An exhibition of over 100 scultpures made by 30 artists from heat-warped
Britney Spears CDs.
</description>
</event>
</sponsor>
<sponsor>
<name>Chicago Perl Mongers</name>
<event>
<name>November meeting</name>
<date>1999-11-15</date>
<description>Regular monthly meeting</description>
</event>
</sponsor>
</events>
We want to enter a date and get a list of the name, description, and sponsor of all the events scheduled on or before that date. One way to do this would be to descend the tree, keeping track of who the sponsor is and what event names we've previously seen, but that's rather clunky. An easier way to do it is to get a list of all the <date> elements in the document, see if they match our criterion, and then navigate up and sideways to get the related information.
levents takes two arguments, the name of an events file and a date in YYYY-MM-DD format and prints out the date, name, sponsor, and description of each event:
#!/usr/bin/perl -w
use strict;
use XML::DOM;
use Text::Wrap;
die "usage: levents file YYYY-MM-DD\n" unless @ARGV==2;
my $parser=new XML::DOM::Parser;
my $doc=$parser->parsefile($ARGV[0]) or die "unable to parse document";
my $search_date=$ARGV[1];
my $dates=$doc->getElementsByTagName('date');
At this point, we have a list of all <date> elements at any level in the document. getElementsByTagName recurses through all descendants of the element it's called on unless you call it with an optional second argument of zero, which will search only immediate children (this option is an extension not included in the Recommendation).
for my $i (0..$dates->getLength-1) {
my $evdate=$dates->item($i);
list_event($evdate) if trim($evdate->getFirstChild->getData) le $search_date;
We assume that a <date> has only one child, a text node.
}
sub list_event {
my $evdate=shift;
print trim($evdate->getFirstChild->getData),': ';
my $event=$evdate->getParentNode;
print trim($event->getElementsByTagName('name')->item(0)->getFirstChild->getData),"\n";
$event is the Element node for the <event> that the <date> was found in. We search all its children for a <name> element, not caring whether the <name> came before or after the <date>.
my $sponsor=$event->getParentNode;
print 'SPONSOR: ',
trim($sponsor->getElementsByTagName('name')->item(0)->getFirstChild->getData),"\n";
for (my $desc=$evdate->getNextSibling;defined $desc;$desc=$desc->getNextSibling) {
if ($desc->getNodeType==ELEMENT_NODE and $desc->getTagName eq 'description') {
This time we look for a <
my $t=trim($desc->getFirstChild->getData);
print wrap(' ' x 4,' ' x 4,$t),"\n";
last;
}
}
print "\n";
}
sub trim {
$_=shift;
tr/\n/ /;
s/^\s+//;
s/\s+$//;
$_;
}
Yes, of course there is. The DOM supplies many methods for creating and modifying documents and subtrees of documents. We'll be covering them next month, when we talk about the many ways of creating XML documents from non-XML sources.
I'd like to start a regular section in this column where readers send in their XML questions or problems. Send your questions to me at ebohlman@netcom.com and be sure to put "XML Clinic" in the Subject: line. I can't guarantee that every question will appear here, nor can I respond personally to every question, but I'll take the most interesting questions and try to answer them.
I'm under contract with Manning Publications to write a book tentatively titled XML Data Manipulation with Perl. It will cover the same sort of material as this column, but in greater depth and with much more substantial code examples. I'm supposed to finish writing it by February, so look for it next spring. I welcome suggestions for application areas to cover and programs to include.