XML (eXtensible Markup Language) is a popular buzzword these days. So it's not too surprising to see XML modules popping up for Perl. What may be surprising is that these modules also work well with MacPerl, under Mac OS.
This month's guest columnist, Arved Sandström, is no stranger to MacPerl modules. His tutorial on XS [1] appeared in the August 1999 issue of PerlMonth. Arved has been watching Eric Bohlman's column, X Marks (up) the Language, [2] . He suggested that a companion piece about MacPerl and XML might be interesting. I agreed.
Arved lives in Dartmouth, Nova Scotia, and enjoys scuba, rock-climbing, bking, hiking, ocean-kayaking...get the drift? He first encountered programming in the mid-70's a la Fortran on a CDC mainframe. Arved has a physics background and likes programming as long as it's not connected to business.
Introductions over, readers, please welcome Arved Sandström.
I'll see you next month.
- Vicki
p.s. If you're a MacPerl user with knowledge to share, I'm looking for other guest columnists for future issues. Just send me a note at macperl@perlmonth.com
Perl started out on UNIX. As a result, numerous Perl functions, and many CPAN modules, simply do not map into the Mac OS domain. It is a tribute to Matthias Neeracher that MacPerl performs as well as it does. But "adapting and improvising", as Clint Eastwood would have it, is often the order of the day.
Most of us have abandoned our initial hopes for carrying out some task using specific Perl modules - it just couldn't be done with MacPerl. Some of this is due to the differences in the operating systems, but until a new and improved MacPerl is released, our community is being slowly left behind as Perl itself evolves.
The good news is - this is not always the case! It's always pleasant to find that a new and important set of CPAN modules works well on Mac OS. Which brings me to the topic of XML processing with MacPerl. XML stands for the Extensible Markup Language. It is, fundamentally, a means for defining custom markup languages and using those languages to structure text.
It is useful to start with an overview of XML processing with Perl, and for that I direct you to Eric Bohlman's column pieces. It is not my intention to teach XML - there are numerous excellent resources for this purpose. Start with Introducing XML [2] in Issue 3 (July, 1999).
Once you have determined that your solution includes XML and Mac OS, you should be aware of various options before you decide on MacPerl as opposed to some other language choice. I am not going to belabour the point, as this is a Perl column, but the following table should save you some grief:
| Language | Coverage | Compatibility | Speed | RAD | SAX | XSL |
|---|---|---|---|---|---|---|
| C | Low | Medium | High (1) | Low (1) | No | No |
| C++ | Medium (2) | Medium | High | Low | Yes (2) | Yes (2) |
| Java | High | High | Low | Medium | Yes | Yes |
| Perl | Medium | Medium | Medium | High | No (3) | No (3) |
| Python (4) | High | High | Medium | High | Yes | Yes |
Notes: Coverage refers to the degree to which a given language supports XML technologies (using freely-available tools). Compatibility is a rough estimate of how well tools work on Mac OS. Speed means XML processing speed. RAD (Rapid Application Development) gives an idea of how quickly one can expect to get useful applications up and running. SAX (Simple API for XML) and XSL (Extensible Stylesheet Language) are two important XML technologies, and Yes means that you can expect to use them on Mac OS.
Other factors you should keep in mind are your familiarity with the languages in question, and the XML processing subset that you need. From this point on, I am going to assume that MacPerl was chosen.
First and foremost, you need an XML parser.
For Perl, this means XML::Parser,
which is an XS distribution available from the
CPAN (Comprehensive Perl Archive Network).
XS files can be tricky to build undr Mac OS,
so you will want to obtain a pre-built MacPerl binary distribution
[5].
I strongly recommend using Chris Nandor's installme and/or untargzipme scripts to unpack and install the distribution [6]. If not, bear in mind that the files use UNIX line-endings; otherwise, follow the Readme instructions.
After you have made the parser work, of course, you may wish to fiddle with
it.
If you have a C compiler, and wish to build the parser's XS file yourself,
consult the XS tutorial [7] for general guidance.
XML::Parser is one of the easier CPAN extension modules
to build for MacPerl.
XML returned from XML::Parser has had Mac newlines (CR, or
0x0D) converted into UNIX newlines (LF, or 0x0A),
as required by the XML Recommendation
(see 2.11, End of Line Handling).
"\n" in Perl is an OS-specific newline
it does not mean LF or 0x0A
so trying to match newlines in processed XML with "\n"
in a MacPerl script will get you nowhere.
XML Perl modules which depend on XML::Parser are also
compatible. I recommend increasing the memory assigned to MacPerl to 20 or
30 MB before doing any serious work with modules such as
XML::DOM or XML::XQL. A full updated list of Perl
XML modules
[8] is available at www.perlxml.com.
Also refer to the Perl-XML FAQ
[8],
maintained by Jonathan Eisenzopf, at the same site.
As a quick reference, I reviewed the following modules in early December, 1999 (explanations of each are found in the modules list referred to previously), testing with XML-Parser-2.27:
Other modules which are Mac OS-friendly include XML::Writer,
and XML::Dumper. I have also tested XML modules related to
CGI and these work fine.
XML::Parser converts all encodings to UTF-8 internally and
returns UTF-8. It uses encoding maps with suffix .enc stored in the
XML:Parser:Encodings folder in your :site_perl folder. The
format of these maps, which are binary, is described in the file
encoding.h that comes with the XML::Encoding module.
To process XML that uses the default encoding for script code 0, smRoman, which is what you will be using in North America on a Mac, download the file ROMAN.TXT from Apple's FTP site [9].
The XML::Encoding module includes two Perl scripts for
preparing XML::Parser encoding maps. The first,
make_encmap, is designed to take the Unicode mapping files and
convert them to an XML representation. Here is the command line (which you
might use under MPW (Macintosh Programmer's Workshop)
[10]):
perl make_encmap macintosh ROMAN.TXT > macintosh.map
The other script, compile_encoding, needs some modification. It uses
pseudohashes (hence the 'use fields' pragma) which are
not available as part of perl 5.004 (or, consequently, MacPerl).
Edit the script to completely remove the PfxMap package.
Further down in the file, where you see something like
$pfxmap = new Pfxmap(min => 255, max => 0, map => []);
change it to
$pfxmap = {min => 255, max => 0, map => []};
Before running compile_encoding, edit the XML map (macintosh.map, in this case), so that the start tag reads
<encmap name='macintosh' expat='yes'>
make_encmap leaves out the expat attribute, which seems to cause problems when running compile_encoding. If you are going to do a bunch of encodings, modify make_encmap so that it adds the expat attribute to the encmap tag.
Run compile_encoding:
perl compile_encoding -o macintosh.enc macintosh.map
Put the map into the Encodings folder. You will now be able to set
the
encoding declaration to "macintosh", and smRoman0 text will be
properly
converted to UTF-8. Alternatively, use the ProtocolEncoding option to
XML::Parser->new() if that is more appropriate.
Let's say that we've used XML::Parser, XML::DOM
and
XML::XQL to run some queries on a batch of XML files.
The original text was prepared with SimpleText
and we used some high-ASCII
(a bit of a misnomer; characters between 128 and 255 inclusive)
in the course of doing so. The output text, of course, is in UTF-8.
How do we convert the characters back to Macintosh smRoman script?
The Unicode::Map8 and Unicode::String extension
modules, also available from the MMP page
[5], save us from re-inventing the wheel (though, having
done so, I can
assure you that writing Perl code to handle UTF-8 is neither particularly
edifying, nor does it result in blazing code). The script below strips out
the XML tags from an XML file, and converts the UTF-8 back to MacRoman.
#!perl -w
use XML::Parser 2.27;
use Unicode::Map8;
use Unicode::String qw(utf8 utf16);
$us = Unicode::String->new();
$map = Unicode::Map8->new("MacRoman");
$text = "";
$parser = new XML::Parser();
$parser->setHandlers(Char => \&text);
$parser->parsefile('test1.xml');
$text =~ s/\012/\n/g;
open(OUT, ">test1.txt") or die;
print OUT $text;
close OUT;
sub text {
my ($expat, $string) = @_;
$us->utf8($string);
$text .= $map->to8($us->utf16());
}
The mappings which come with Unicode::Map8 are very complete,
and other Macintosh character sets (MacGreek, MacCyrillic, MacLatin2,
MacIcelandic, and MacTurkish) are also supported. Refer to the manpage for
each module to see what it can do. The umap script in
the distribution can be used to list all the maps;
modify line 246, replacing the '/' with ':',
then run perl umap --list.
Perl-based XML processing on Mac OS is fundamentally the same as on Windows or UNIX. There are limitations, but workarounds usually can be found. For some applications, however, the MacPerl aficionado will have to accept that the solution is more important than the implementation, and use Java (MRJ) or MacPython.
[1] Building MacPerl Extensions, Issue 4 of PerlMonth, August, 1999
[2] X Marks (up) the Language was introduced in Issue 3 of PerlMonth (July, 1999). The first column was Introducing XML.
[3] The Apache XML Project can be found at xml.apache.org.
[4] This is available from the Python FTP site.
[5] This is available from the MacPerl Module Porters' Page.
[6] Get Chris Nandor's cpan-mac scripts on CPAN.
[7] The XS tutorial can be found in the August, 1999 issue of PerlMonth. It is also posted in the Tutorials area of The MacPerl Pages.
[8] See perl-xml-modules .html for the current list of XML modules. See perl-xml-faq.html for the FAQ.
[9] Download ROMAN.TXT from ftp://ftp.unicode.org/public/MAPPINGS/VENDORS/APPLE/.
[10] MPW is available from Apple's MPW ToolZone.