Introducing XML

By: Eric Bohlman

Languages for Data

When we talk about a "computer language," we're usually thinking of a machine-readable language (like Perl) for describing our programs. But it's often the case that our data are complex enough that we really need a machine-readable language to describe them as well. In their classic-the-moment-it-was-released book The Practice of Programming, Brian Kernighan and Rob Pike devote a whole chapter to the importance of having good notation for one's data.

Some "languages" for data are exceedingly simple; most of us have spent plenty of time working with comma-separated lists of data items, one record per line. Some are intermediate in complexity, such as the language used in makefiles for building software, where you have lines with no leading whitespace to specify the program to be built and the files it depends on, followed by tab-indented lines specifying the commands to invoke to build the program. Some are quite complex, like the language used to configure the Unix sendmail program.

One of the big problems with developing ad-hoc data languages is that programs often have to exchange data with each other, and it does no good if each programmer has a different idea of what the data should look like. It's fun but wasteful to spend time creating and debugging parsers for other people's proprietary data formats. The advent of the WWW has made these problems even worse, since many applications now need to pull in data from some other application halfway across the world.

SGML

Starting about thirty years ago, the technical-publishing industry began trying to come up with a common language to describe a very specific type of data: technical documents. It soon became apparent that a single language that could describe any technical document would be a great big unwieldy mess, so the goal was changed to defining a meta-language that could be used to define specific languages for "marking up" technical documents.

The culmination of this effort was SGML (Standard Generalized Markup Language), which became an international standard in 1986. SGML's name is somewhat misleading because it isn't a markup language itself; it's a meta- language for defining markup languages that can be used to describe families of documents. Each such markup language is called an "application" of SGML, which can be slightly confusing to people accustomed to thinking of "application" as meaning a program or set of programs.

Most of the markup languages created using SGML were little-known outside the publishing industry, military procurement offices, and academic departments doing research on historical texts. But in the early 1990s, one SGML application burst on the scene and quickly became a matter of common knowledge: HTML, the HyperText Markup Language. All SGML applications resemble HTML in that a marked-up document consists of a single element which can in turn contain nested elements, each of which can contain other elements or plain text. The beginning and end of each element are marked by start tags and end tags enclosed in angle brackets.

Well, that's a rather oversimplified description of how SGML documents are marked up. Because the first SGML systems needed to deal with legacy documents that had to be hand-converted, SGML allows an awful lot of syntactic options and has many of optional features. Unfortunately, these make an SGML parser (a program or module that reads an SGML document and breaks it up into its components) hard to write and large in size. An SGML parser also can't properly parse a document unless it has a formal grammar, known as a DTD, for the particular SGML application the document is written in. It can't figure out the structure of a document all by itself.

Enter XML

The experience with SGML showed that something similar would be an ideal way to describe structured data that had to be passed between programs. It also showed that "similar" really had to mean "simpler." In 1996, the World Wide Web Consortium (W3C) began an effort to define a subset of SGML that was simple enough to use for data interchange, particularly over the Web. The name XML, for eXtensible Markup Language, was chosen for the project (missing a chance to fix the misleading aspect of SGML's name). In February 1998 the W3C adopted the recommendation of its XML Working Group.

Unlike the standard defining SGML, which is several hundred printed pages long, the XML Recommendation is only about 35 printed pages long. It can be found at http://www.w3.org/TR/REC-xml; an annotated version with comments by Tim Bray, one of the co-editors of the recommendation, can be found on the XML.COM Web site.

The Mechanics of XML: a Simplified Overview

Every XML document consists of an optional prolog followed by a single element, known as the root element. All elements begin with a start tag, which consists of a left angle-bracket (<), the element-type name of the element, an optional list of attribute-value pairs, and a closing right angle-bracket (>). All elements end with an end tag, which is like the start tag but has a slash immediately after the left angle-bracket and has no attribute list. Empty elements, that is, elements that have no text or other elements as children, can be written using no end tag and a special form of the start tag, where a slash comes before the closing right angle-bracket.

A very simple XML document

<document>Hello, World!</document>

In this example, our document consists of a single root element, namely document with only text as its content. It has no attributes. Let's go ahead and give it one:

<document type="example">Hello, World!</document>

Now our document element has a single attribute, type, whose value is example. In XML, all attribute values must be enclosed in double or single quotes, and all attributes must have values. Constructs like

<document example>Hello, World!</document>

which are legal in full SGML applications like HTML, are not allowed in XML. This was one of the compromises that the XML Working Group made in order to make XML extremely simple to parse.

Elements inside elements inside...

In the examples so far, our documents have had only one element, with only text (more formally called characters) as its content. But XML would be an inadequate language for describing data if it didn't provide structure. In addition to text, elements can hold other elements. Let's take a look at a short document with nested elements:

<pods>
  <group name="faqs">
    <doc>perlfaq1.pod</doc>
    <doc>perlfaq2.pod</doc>
  </group>
  <group name="object-tutorials">
    <doc>perltoot.pod</doc>
  </group>
</pods>

Here our root-level pods element contains two group elements, each of which contains one or more doc elements, each of which contains characters. Actually, all of the elements in this example contain characters, because whitespace is always significant in XML! Those line- breaks and indentations are whitespace which will be passed on by the parser to the application.

As you've probably guessed, the data this example document is describing is part of the hierarchy of the documentation that comes with Perl. In future columns, we'll be building a "super perldoc" that will use XML to describe the documentation hierarchy.

To learn more

This article can't give a complete tutorial on all the rules for writing XML documents. In my experience, the XML.COM Web site is the best place to start looking for XML documentation and tutorials.

What XML Isn't

There are a lot of misconceptions floating around regarding what XML is and why it was designed. XML is not:

There's an awful lot of marketing and journalistic hype about XML floating around. Some of it makes extravagant claims for what XML can do; some of it sets up straw-man arguments about XML and demolishes them in an attempt to claim that XML is just a fad. This column will stick to facts, not hype.

Where we're going

This pretty much wraps up our first installment in "X Marks (up) the Language." In our next column, we'll begin exploring the available Perl tools for dealing with XML, starting with the parsers. In the meantime, you should take a look at the Perl-XML Web site, where Jonathan Eisenzopf has put together a set of FAQs on XML and the Perl tools for it, as well as a comprehensive listing of all the available Perl modules dealing with XML.