MacPerl and XML

By: Vicki Brown

XML (eXtensible Markup Language) is a popular buzzword these days. So it's not too surprising to see XML modules popping up for Perl. What may be surprising is that these modules also work well with MacPerl, under Mac OS.

This month's guest columnist, Arved Sandström, is no stranger to MacPerl modules. His tutorial on XS [1] appeared in the August 1999 issue of PerlMonth. Arved has been watching Eric Bohlman's column, X Marks (up) the Language, [2] . He suggested that a companion piece about MacPerl and XML might be interesting. I agreed.

Arved lives in Dartmouth, Nova Scotia, and enjoys scuba, rock-climbing, bking, hiking, ocean-kayaking...get the drift? He first encountered programming in the mid-70's a la Fortran on a CDC mainframe. Arved has a physics background and likes programming as long as it's not connected to business.

Introductions over, readers, please welcome Arved Sandström.

I'll see you next month.

- Vicki

p.s. If you're a MacPerl user with knowledge to share, I'm looking for other guest columnists for future issues. Just send me a note at macperl@perlmonth.com


Perl started out on UNIX. As a result, numerous Perl functions, and many CPAN modules, simply do not map into the Mac OS domain. It is a tribute to Matthias Neeracher that MacPerl performs as well as it does. But "adapting and improvising", as Clint Eastwood would have it, is often the order of the day.

Most of us have abandoned our initial hopes for carrying out some task using specific Perl modules - it just couldn't be done with MacPerl. Some of this is due to the differences in the operating systems, but until a new and improved MacPerl is released, our community is being slowly left behind as Perl itself evolves.

The good news is - this is not always the case! It's always pleasant to find that a new and important set of CPAN modules works well on Mac OS. Which brings me to the topic of XML processing with MacPerl. XML stands for the Extensible Markup Language. It is, fundamentally, a means for defining custom markup languages and using those languages to structure text.

XML and Mac OS

It is useful to start with an overview of XML processing with Perl, and for that I direct you to Eric Bohlman's column pieces. It is not my intention to teach XML - there are numerous excellent resources for this purpose. Start with Introducing XML [2] in Issue 3 (July, 1999).

Once you have determined that your solution includes XML and Mac OS, you should be aware of various options before you decide on MacPerl as opposed to some other language choice. I am not going to belabour the point, as this is a Perl column, but the following table should save you some grief:

XML Language Options, under Mac OS
Language Coverage Compatibility Speed RAD SAX XSL
CLowMediumHigh (1) Low (1)NoNo
C++Medium (2)MediumHigh LowYes (2)Yes (2)
JavaHighHighLow MediumYesYes
PerlMediumMediumMedium HighNo (3)No (3)
Python (4)HighHighMedium HighYesYes

Notes: Coverage refers to the degree to which a given language supports XML technologies (using freely-available tools). Compatibility is a rough estimate of how well tools work on Mac OS. Speed means XML processing speed. RAD (Rapid Application Development) gives an idea of how quickly one can expect to get useful applications up and running. SAX (Simple API for XML) and XSL (Extensible Stylesheet Language) are two important XML technologies, and Yes means that you can expect to use them on Mac OS.

  1. The main XML parser written in C is James Clark's expat, which forms the core of Perl (and MacPerl) XML parsing.

  2. Keep an eye on the Apache XML Project [3]. The Xerces-C and Xalan-C processors offer the best bet for C++ developments that will be useable (with porting) on Mac OS. Or, check out IBM's xml4c parser.

  3. You can do SAX if you've got a recent (5.005_x) Perl. But, for now, that rules out MacPerl. There is no good XSL solution available for Perl at this time.

  4. MacPython is available from the Python folks [4].

Other factors you should keep in mind are your familiarity with the languages in question, and the XML processing subset that you need. From this point on, I am going to assume that MacPerl was chosen.

Getting Started with MacPerl XML Processing

First and foremost, you need an XML parser. For Perl, this means XML::Parser, which is an XS distribution available from the CPAN (Comprehensive Perl Archive Network). XS files can be tricky to build undr Mac OS, so you will want to obtain a pre-built MacPerl binary distribution [5].

I strongly recommend using Chris Nandor's installme and/or untargzipme scripts to unpack and install the distribution [6]. If not, bear in mind that the files use UNIX line-endings; otherwise, follow the Readme instructions.

After you have made the parser work, of course, you may wish to fiddle with it. If you have a C compiler, and wish to build the parser's XS file yourself, consult the XS tutorial [7] for general guidance. XML::Parser is one of the easier CPAN extension modules to build for MacPerl.

XML returned from XML::Parser has had Mac newlines (CR, or 0x0D) converted into UNIX newlines (LF, or 0x0A), as required by the XML Recommendation (see 2.11, End of Line Handling). "\n" in Perl is an OS-specific newline – it does not mean LF or 0x0A – so trying to match newlines in processed XML with "\n" in a MacPerl script will get you nowhere.

XML Perl modules which depend on XML::Parser are also compatible. I recommend increasing the memory assigned to MacPerl to 20 or 30 MB before doing any serious work with modules such as XML::DOM or XML::XQL. A full updated list of Perl XML modules [8] is available at www.perlxml.com. Also refer to the Perl-XML FAQ [8], maintained by Jonathan Eisenzopf, at the same site.

As a quick reference, I reviewed the following modules in early December, 1999 (explanations of each are found in the modules list referred to previously), testing with XML-Parser-2.27:

Other modules which are Mac OS-friendly include XML::Writer, and XML::Dumper. I have also tested XML modules related to CGI and these work fine.

Production of the Macintosh Encoding for XML::Parser

XML::Parser converts all encodings to UTF-8 internally and returns UTF-8. It uses encoding maps with suffix .enc stored in the XML:Parser:Encodings folder in your :site_perl folder. The format of these maps, which are binary, is described in the file encoding.h that comes with the XML::Encoding module.

To process XML that uses the default encoding for script code 0, smRoman, which is what you will be using in North America on a Mac, download the file ROMAN.TXT from Apple's FTP site [9].

The XML::Encoding module includes two Perl scripts for preparing XML::Parser encoding maps. The first, make_encmap, is designed to take the Unicode mapping files and convert them to an XML representation. Here is the command line (which you might use under MPW (Macintosh Programmer's Workshop) [10]):

perl make_encmap macintosh ROMAN.TXT > macintosh.map

The other script, compile_encoding, needs some modification. It uses pseudohashes (hence the 'use fields' pragma) which are not available as part of perl 5.004 (or, consequently, MacPerl). Edit the script to completely remove the PfxMap package. Further down in the file, where you see something like

$pfxmap = new Pfxmap(min => 255, max => 0, map => []);

change it to

$pfxmap = {min => 255, max => 0, map => []};

Before running compile_encoding, edit the XML map (macintosh.map, in this case), so that the start tag reads

<encmap name='macintosh' expat='yes'>

make_encmap leaves out the expat attribute, which seems to cause problems when running compile_encoding. If you are going to do a bunch of encodings, modify make_encmap so that it adds the expat attribute to the encmap tag.

Run compile_encoding:

perl compile_encoding -o macintosh.enc macintosh.map

Put the map into the Encodings folder. You will now be able to set the encoding declaration to "macintosh", and smRoman0 text will be properly converted to UTF-8. Alternatively, use the ProtocolEncoding option to XML::Parser->new() if that is more appropriate.

Converting from UTF-8 back to Macintosh Standard Roman

Let's say that we've used XML::Parser, XML::DOM and XML::XQL to run some queries on a batch of XML files. The original text was prepared with SimpleText and we used some high-ASCII (a bit of a misnomer; characters between 128 and 255 inclusive) in the course of doing so. The output text, of course, is in UTF-8. How do we convert the characters back to Macintosh smRoman script?

The Unicode::Map8 and Unicode::String extension modules, also available from the MMP page [5], save us from re-inventing the wheel (though, having done so, I can assure you that writing Perl code to handle UTF-8 is neither particularly edifying, nor does it result in blazing code). The script below strips out the XML tags from an XML file, and converts the UTF-8 back to MacRoman.

#!perl -w

use XML::Parser 2.27;
use Unicode::Map8;
use Unicode::String qw(utf8 utf16);

$us   = Unicode::String->new();
$map  = Unicode::Map8->new("MacRoman");
$text = "";

$parser = new XML::Parser();
$parser->setHandlers(Char => \&text);
$parser->parsefile('test1.xml');

$text =~ s/\012/\n/g;
open(OUT, ">test1.txt") or die;
print OUT $text;
close OUT;

sub text {
    my ($expat, $string) = @_;
    $us->utf8($string);
    $text .= $map->to8($us->utf16());
}

The mappings which come with Unicode::Map8 are very complete, and other Macintosh character sets (MacGreek, MacCyrillic, MacLatin2, MacIcelandic, and MacTurkish) are also supported. Refer to the manpage for each module to see what it can do. The umap script in the distribution can be used to list all the maps; modify line 246, replacing the '/' with ':', then run perl umap --list.

Conclusion

Perl-based XML processing on Mac OS is fundamentally the same as on Windows or UNIX. There are limitations, but workarounds usually can be found. For some applications, however, the MacPerl aficionado will have to accept that the solution is more important than the implementation, and use Java (MRJ) or MacPython.


References

[1] Building MacPerl Extensions, Issue 4 of PerlMonth, August, 1999

[2] X Marks (up) the Language was introduced in Issue 3 of PerlMonth (July, 1999). The first column was Introducing XML.

[3] The Apache XML Project can be found at xml.apache.org.

[4] This is available from the Python FTP site.

[5] This is available from the MacPerl Module Porters' Page.

[6] Get Chris Nandor's cpan-mac scripts on CPAN.

[7] The XS tutorial can be found in the August, 1999 issue of PerlMonth. It is also posted in the Tutorials area of The MacPerl Pages.

[8] See perl-xml-modules .html for the current list of XML modules. See perl-xml-faq.html for the FAQ.

[9] Download ROMAN.TXT from ftp://ftp.unicode.org/public/MAPPINGS/VENDORS/APPLE/.

[10] MPW is available from Apple's MPW ToolZone.