Perl SAX 2.0 Binding

SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used.

This document describes the version of SAX used by Perl modules. The original version of SAX 2.0, for Java, is described at http://sax.sourceforge.net/.

There are two basic interfaces in the Perl version of SAX, the parser interface and the handler interface. The parser interface creates new parser instances, starts parsing, and provides additional information to handlers on request. The handler interface is used to receive parse events from the parser. This pattern is also commonly called "Producer and Consumer" or "Generator and Sink". Note that the parser doesn't have to be an XML parser, all it needs to do is provide a stream of events to the handler as if it were parsing XML. But the actual data from which the events are generated can be anything, a Perl object, a CSV file, a database table...

SAX is typically used like this:

    my $handler = MyHandler->new();
    my $parser = AnySAXParser->new( Handler => $handler );
    $parser->parse($uri);

Handlers are typically written like this:

    package MyHandler;

    sub new {
        my $type = shift;
        return bless {}, $type;
    }

    sub start_element {
        my ($self, $element) = @_;

        print "Starting element $element->{Name}\n";
    }

    sub end_element {
        my ($self, $element) = @_;

        print "Ending element $element->{Name}\n";
    }

    sub characters {
        my ($self, $characters) = @_;

        print "characters: $characters->{Data}\n";
    }

    1;

Basic SAX Parser

These methods and options are the most commonly used with SAX parsers and event generators.

Applications may not invoke a parse() method again while a parse is in progress (they should create a new SAX parser instead for each nested XML document). Once a parse is complete, an application may reuse the same parser object, possibly with a different input source.

During the parse, the parser will provide information about the XML document through the registered event handlers. Note that an event that hasn't been registered (ie that doesn't have its corresponding method in the handler's class) will not be called. This allows one to only get the events one is interested in.

parse(uri [, options])
Parses the XML instance identified by uri (a system identifier). options can be a list of option, value pairs or a hash. Options include Handler, features and properties, and advanced SAX parser options. parse() returns the result of calling the end_document() handler. The options supported by parse() may vary slightly if what is being "parsed" isn't XML.

parse_file(stream [, options])
Parses the XML instance in the already opened stream, an IO::Handler or similar. options are the same as for parse(). parse_file() returns the result of calling the end_document() handler.

parse_string(string [, options])
Parses the XML instance in string. options are the same as for parse(). parse_string() returns the result of calling the end_document() handler.

Handler
The default handler object to receive all events from the parser. Applications may change Handler in the middle of the parse and the SAX parser will begin using the new handler immediately. The Advanced SAX document lists a number of more specialized handlers that can be used should you wish to dispatch different types of events to different objects.

Basic SAX Handler

These methods are the most commonly used by SAX handlers.

start_document(document)
Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods (except for set_document_locator() in advanced SAX handlers).

No properties are defined for this event (document is empty).

end_document(document)
Receive notification of the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

No properties are defined for this event (document is empty).

The return value of end_document() is returned by the parser's parse() methods.

start_element(element)
Receive notification of the start of an element.

The Parser will invoke this method at the beginning of every element in the XML document; there will be a corresponding end_element() event for every start_element() event (even when the element is empty). All of the element's content will be reported, in order, before the corresponding end_element() event.

element is a hash with these properties:
Name The element type name (including prefix).
Attributes The attributes attached to the element, if any.
If namespace processing is turned on (which is the default), these properties are also available:
NamespaceURI The namespace of this element.
Prefix The namespace prefix used on this element.
LocalName The local name of this element.
Attributes is a hash keyed by JClark namespace notation. That is, the keys are of the form "{NamespaceURI}LocalName". If the attribute has no NamespaceURI, then it is simply "{}LocalName". Each attribute is a hash with these properties:
Name The attribute name (including prefix).
Value The normalized value of the attribute.
NamespaceURI The namespace of this attribute.
Prefix The namespace prefix used on this attribute.
LocalName The local name of this attribute.

end_element(element)
Receive notification of the end of an element.

The SAX parser will invoke this method at the end of every element in the XML document; there will be a corresponding start_element() event for every end_element() event (even when the element is empty).

element is a hash with these properties:
Name The element type name (including prefix).
If namespace processing is turned on (which is the default), these properties are also available:
NamespaceURI The namespace of this element.
Prefix The namespace prefix used on this element.
LocalName The local name of this element.

characters(characters)
Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks (however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information).

characters is a hash with this property:

Data The characters from the XML document.

ignorable_whitespace(characters)
Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

characters is a hash with this property:

Data The whitespace characters from the XML document.

Exceptions

Conformant XML parsers are required to abort processing when well-formedness or validation errors occur. In Perl, SAX parsers use die() to signal these errors. To catch these errors and prevent them from killing your program, use eval{}:

    eval { $parser->parse($uri) };
    if ($@) {
        # handle error
    }

Exceptions can also be thrown when setting features or properties on the SAX parser (see advanced SAX below).

Exception values ($@) in SAX are hashes blessed into the package that defines their type, and have the following properties:

Message A detail message for this exception.
Exception The embedded exception, or undef if there is none.
If the exception is raised due to parse errors, these properties are also available:
ColumnNumber The column number of the end of the text where the exception occurred.
LineNumber The line number of the end of the text where the exception occurred.
PublicId The public identifier of the entity where the exception occurred.
SystemId The system identifier of the entity where the exception occurred.


Advanced SAX