dwww Home | Show directory contents | Find package

6. The Schema module
********************


6.1. XML Grammars
=================

There are several steps that applications must go through when they
have to use XML files:

* Make sure the XML file is well-formed.

  This is a basic step where we ensure that XML tags are correctly
  nested, that closing tags have the same names as the matching
  opening tags, that attribute values are quoted… This corresponds to
  a syntactic parser in a compiler.

  This step does not depend on the application domain. One file that
  is well-formed will always be so, no matter in what context you use
  it.

* Make sure the contents of the XML file is semantically valid.

  Depending on the application domain, we must ensure that the content
  of the file makes sense. This step is highly application dependent,
  and a file that is usable in one application might not be usable in
  another one.

  This is the phase in which the application needs to check whether a
  given XML file has all its required attributes, whether the children
  of an XML tag are the expected ones, whether the type of the
  attributes is valid…

* Use the XML file in the application.

  This is done through the already-described SAX or DOM parsers

The first phase is mandatory, and necessarily enforced by XML/Ada. You
will not be able to access the contents of the XML file if it isn’t
well-formed.

The second phase is provided by the Schema module in XML/Ada. Although
such constraints can be checked at the application level, with ad hoc
code, it is generally easier to maintain a separate file that
describes the valid semantic contents of the file, than maintain
specific code when the semantic changes. It is also difficult not to
forget special cases when doing the validating through a set of *if*
statements in the Ada code.

XML provides two ways to describe additional constraints that a file
must satisfy in order to be considered as valid.

* DTD

  The Document Type Description is the original way to do this. They
  come directly from the ancestor of XML, SGML. All XML parsers must
  parse the DTD, and report events if the user is using SAX. However,
  not all parsers are able to validate the document against a DTD
  (XML/Ada doesn’t).

  Their use tends to greatly diminish. Among their limitations are a
  limited capability to express constraints on the order of tag
  children, the fact that the DTDs themselves are written in a
  separate language totally different from XML, and that users must
  learn this language as a result.

* XML Schema

  The XML schemas are replacing the DTDs. They are written in XML, and
  provide an extensive capability to describe what the XML document
  should look like. In fact, almost all Ada types can be described in
  an XML schema, including range constraints, arrays, records, type
  inheritance, abstract types…

  It is for instance possible to indicate that the value of a
  preference, in our example, must be a string of length 6. Any other
  length will result in a validation error.


6.2. XML Schema Syntax
======================

The Schema module provides subprograms and types to parse an XML
schema and validate an XML document with this schema.

This document does not provide a full documentation on the format of
XML Schemas. This is extensive, has several obscure features, which,
although supported by XML/Ada, are of little use in most practical
usages. We refer the reader to the first part of the XML Schema
specification, which is designed as a tutorial
(http://www.w3.org/TR/xmlschema-0/).

The typical extension for a schema file is ".xsd".

A schema file must be a valid XML file, and thus start with the usual
"<?xml version="1.0" ?>" line. The root node must be named *schema*,
and belong to the namespace (http://www.w3.org/2001/XMLSchema/). The
handling of namespaces is fairly powerful, but also complex. A given
XML document might have nodes belonging to several namespaces, and
thus several schema files might have to be loaded, each defining one
of the namespaces.

In the following simple example, we will not define our schema for a
specific namespace, and thus no special attribute is needed for the
root node.  Thus, our document will be organized as:

   <?xml version="1.0" ?>
   <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
       ... rest of the description goes here ...
   </xsd:schema>

An XML schema does not enforce a specific root node in the XML
documents it validates. However, it must define all the valid elements
that can be used in the XML file. This is done through the *<element>*
tag, which takes one mandatory attribute, the name of the element we
are defining.

The contents of the element is then defined in one of two ways:

* Through a *type* attribute.

  Schemas come with a number of predefined simple types. A simple type
  is such that an element of that type accepts no child node, and that
  its contents must satisfy additional constraints (be an integer, a
  date…).

  Among the predefined simple types (which are all defined in the
  namespace http://www.w3.org/2001/XMLSchema/), one can find:
  *string*, *integer*, *byte*, *date*, *time*, *dateTime*, *boolean*…

  If no additional constraint should be enforced on this simple type
  when applied to the element, the type of the element is given
  through a *type* attribute, as in:

     <xsd:element name="tag1" type="xsd:string"  />
     <xsd:element name="tag2" type="xsd:boolean" />

  which would accept the following XML files:

     <tag1>Any string is valid here</tag1>

  and:

     <tag2>true</tag2>

  but not:

     <tag2>String</tag2>

  As will be described later, it is possible to create new types in
  XML schema, which are created with a name. Such new types can also
  be associated with the element through the *type* attribute.

* Through an inline type definition

  If the element must accept child elements, or if a further
  constraint needs to be enforced on the list of valid values, one
  must create the type. As mentioned above, this can be done by
  creating a type separately and referencing it by name, or through an
  inline type definition.

  The syntax is mostly the same in both cases. Schemas distinguish
  between the notion of simple types (that accept no child element)
  and complex types (that accept child elements, and possibly text
  value).

  To define a simple type, based on string, but that only allows a
  limited set of values (similar to an Ada enumeration), one would
  create a restriction of the standard string type, as in:

     <xsd:element name="tag3">
      <xsd:simpleType>
        <xsd:restriction base="xsd:string">
          <xsd:enumeration value="value1" />
          <xsd:enumeration value="value2" />
        </xsd:restriction>
      </xsd:simpleType>
     </xsd:element>

  Similarly, we could create an integer type whose valid range of
  values is between 10 and 20, as in:

     <xsd:element name="tag4">
      <xsd:simpleType>
        <xsd:restriction base="xsd:byte">
          <xsd:minInclusive value="10" />
          <xsd:maxInclusive value="20" />
        </xsd:restriction>
      </xsd:simpleType>
     </xsd:element>

  Complex types allow elements to have child nodes, as well as
  attributes. The list of valid attributes is created by a set of
  *<xsd:attribute>* tags, and the list of valid child nodes is
  generally defined either through a *<xsd:choice>* or a
  *<xsd:sequence>* node (although it is possible to indicate that any
  child node is authorized, among other things).

  *<xsd:choice>* indicates the children can appear in any order,
  whereas *<xsd:sequence>* enforces a specific order on children.

  In both cases, extra attributes can be specified to indicate the
  number of times the sequence or choice itself can be repeated, or
  that each child node can appear.

  For instance, we can indicate that *tag5* accepts between 1 and 4
  child nodes, chosen among *tag6* and *tag7*, but that the latter, if
  present, can only appear once. In addition, tag5 accepts one
  optional attribute. Note that the type of tag6 and tag7 is here
  specified through a *type* attribute, although it could in turn be
  defined inline:

     <xsd:element name="tag5">
       <xsd:complexType>
         <xsd:choice>
           <xsd:element name="tag6" type="xsd:string"
                        minOccurs="1" maxOccurs="3/>
           <xsd:element name="tag7" type="xsd:string" maxOccurs="1" />
         </xsd:choice>
         <xsd:attribute name="attr" type="xsd:boolean" use="optional" />
       </xsd:complexType>
     </xsd:element>

  In the example above, if *tag6* was defined elsewhere in the schema,
  we could use a reference to it, instead of duplicating its type
  definition, as in:

     <xsd:element ref="tag6" />

  If you need an element with no child element (just a string value),
  but that accepts attributes, this also must be defined through a
  complex type, as in:

     <xsd:element name="tag8" />
       <xsd:complexType>
         <xsd:simpleContent>
           <xsd:extension base="xsd:string">
              <xsd:attribute name="attr" type="xsd:boolean" />
           </xsd:extension>
         </xsd:simpleContent>
       </xsd:complexType>
     </xsd:element>

As mentioned before, instead of defining inline types, we could
explicitly declare them, and reference them in the element declaration
later on:

   <xsd:simpleType name="string_of_length_10">
     <xsd:restriction base="xsd:string" />
       <xsd:length value="10"/>
     </xsd:restriction>
   </xsd:simpleType>
   <xsd:element name="tag9" type="string_of_length_10" />


6.3. Connecting XML documents and schemas
=========================================

There are several ways that XML/Ada uses to find what schema to use
when validating a file.

* Manually creating the grammar.

  The schema module contains the package *Schema.Validators* which
  allows you to create a grammar by hand. It is very low-level, and it
  is likely that you will never need to use it. It is used internally
  mostly, and when creating the schema which is used to validate
  schema files themselves.

* Explicitly parsing a schema file

  Parsing a schema file can be done through a call to parse for a
  reader derived from *Schema.Schema_Readers.Schema_reader*.  As
  usual, you call *Parse*, and pass it an input source. As output, you
  get access to a grammar, that can then be given to another instance
  of a *Schema.Readers.Validating_Reader*.

  This technique will generally be used when you need to validate
  several XML files with the same grammar: you parse the grammar only
  once, and then reuse its instance, instead of reparsing the ".xsd"
  file every time:

     with Ada.Text_IO;   use Ada.Text_IO;
     with Schema.Schema_Readers, Schema.Validators, Input_Sources.File;
     use  Schema.Schema_Readers, Schema.Validators, Input_Sources.File;

     procedure SchemaExample2 is
        Grammar : XML_Grammar;
        Schema  : Schema_Reader;
        Read    : File_Input;
     begin
        Open ("file.xsd", Read);
        Parse (Schema, Read);
        Close (Read);

        Grammar := Get_Grammar (Schema);

      exception
         when XML_Validation_Error | XML_Not_Implemented =>
            Put_Line ("ERROR: " & Get_Error_Message (Schema));
     end SchemaExample2;

  In the example above, the schema file itself is validated against
  the official schema for schema files.

  The resulting grammar object is in fact a collection of parsed
  schema files, each associated with its own namespace. It can be kept
  as long as you need it in your application. Memory will
  automatically be reclaimed when no longer needed.

  Every time you parse an XML file later on, you must associate the
  Grammar with the parser:

     declare
        Read      : File_Input;
        My_Reader : Validating_Reader;
     begin
        Set_Grammar (My_Reader, Grammar);
        Set_Feature (My_Reader, Schema_Validation_Feature, True);
        Open (Xml_File.all, Read);
        Parse (My_Reader, Read);
        Close (Read);

     exception
        when XML_Validation_Error | XML_Not_Implemented =>
           Put_Line ("ERROR: " & Get_Error_Message (My_reader));
     end;

* Implicitly parsing the schema

  Two special attributes, defined in the Schema standard, can be used
  to indicate, in an XML document itself, that it should be validated
  with a specific schema.

  These attributes are both defined in a special namespace,
  http://www.w3.org/2001/XMLSchema-instance.

  * *xsi:noNamespaceSchemaLocation*

    The value of this attribute is the name of a file that contains
    the schema to use for elements that are not associated with a
    specific namespace.

  * *xsi:schemaLocation*

    This attribute is a list of strings, alternatively the prefix of a
    namespace and the name of an xsd file to use for that namespace.
    For instance, ""ns1 file1.xsd ns2 file2.xsd"".

  When it encounters any of these two attributes, XML/Ada will
  automatically parse the corresponding schema files, and use the
  result to validate the file.

  See the section below on optimizing the parsing of the grammars, as
  a way to avoid parsing the same grammar multiple times.


6.4. Validating documents with SAX
==================================

XML/Ada is quite unique in the category of XML parsers, since it
allows the validation of XML files when you are using an event-based
parser with SAX. Most other XML parsers only work on DOM trees.

Basing the validation on SAX is more efficient, since there is no need
to read the whole XML stream (or even the grammar) in memory before
starting the validation, and errors can be reported immediately.

It also requires less memory to run, and thus can validate large XML
documents.

It also means that even if you are using SAX, and not DOM, you still
have access to the validation features.

Validating an XML document while parsing it is basically done the same
as when using SAX itself. Instead of inheriting from
*Sax.Readers.Reader*, your tagged type must inherit from
*Schema.Readers.Validating_Reader*.

As usual, you can still override the predefined primitive operations
like *Start_Element*, *End_Element*…

Note the activation of the *Schema_Validation_Feature* feature,
without which no validation takes place:

   --
   --  Copyright (C) 2017, AdaCore
   --

   with Ada.Text_IO;        use Ada.Text_IO;
   with Sax.Readers;        use Sax.Readers;
   with Schema.Readers;     use Schema.Readers;
   with Schema.Validators;
   with Input_Sources.File; use Input_Sources.File;

   procedure SchemaExample is
      Input : File_Input;
      My_Reader : Validating_Reader;
   begin
      Set_Public_Id (Input, "Preferences file");
      Open ("pref.xml", Input);

      Set_Feature (My_Reader, Schema_Validation_Feature, True);
      Parse (My_Reader, Input);

      Close (Input);

   exception
      when Schema.Validators.XML_Validation_Error =>
         Put_Line ("ERROR: " & Get_Error_Message (My_Reader));
   end SchemaExample;


6.5. Validating documents with DOM
==================================

This is very similar to using DOM itself, except the base class of
your reader should be *Schema.Dom_Readers.Tree_Reader*. Going back to
the example described in Using DOM, you would use the following to
validate XML streams before generating the DOM tree.

   --
   --  Copyright (C) 2017, AdaCore
   --

   with Input_Sources.File; use Input_Sources.File;
   with Sax.Readers;        use Sax.Readers;
   with DOM.Core;           use DOM.Core;
   with Schema.Dom_Readers; use Schema.Dom_Readers;

   procedure DomSchemaExample is
      Input  : File_Input;
      Reader : Schema.Dom_Readers.Tree_Reader;
      Doc    : Document;
   begin
      Set_Public_Id (Input, "Preferences file");
      Open ("pref_with_xsd.xml", Input);

      Set_Feature (Reader, Validation_Feature, False);

      Parse (Reader, Input);
      Close (Input);

      Doc := Get_Tree (Reader);

      Free (Reader);
   end DomSchemaExample;


6.6. Unsupported schema elements
================================

Not all aspects of XML schemas are supported by XML/Ada. In
particular, it does not currently support XPath, so any part of the
schema that is related to XPath expressions (for instance *<xsd:key>*
and *<xsd:unique>*) are not supported currently.


6.7. Optimizing the parsing of grammars
=======================================

It is often the case that a given ".xsd" file will be reused multiple
times to validate XML documents. In such case, you do not want to
parse the file multiple times, but instead reuse an already existing
*XML_Grammar* object. Of course, this is a tradeoff between memory
used to keep the grammar in memory, and the time it would take to
reparse the grammar.

This is easily done when you have a single ".xsd" file to reuse for
all the XML files. Simply call *Set_Grammar* on the parser before you
parse the file, as in:

   declare
      G : constant XML_Grammar := ...;  --  parsed earlier
      R : Validating_Reader;
      F : File_Input;
   begin
      R.Set_Grammar (G);
      Open ("file.xml", F);
      R.Parse (F);
      Close (F);
      ...;  --  Do something with the resulting tree
   end;

The second use case is a bit more complex: you have several XSD files
to parse, and the XML files will need either of these. If you are
using namespaces, there is nothing special to do, and the same code as
above applies: you can simply parse each of the XSD files into the
same XML_Grammar, and then use that grammar to parse all the XML
files, as in:

   declare
      G : XML_Grammar;
      S : Schema_Reader;
      F : File_Input;
      R : Validating_Reader;
   begin
      Open ("grammar1.xsd", F);
      S.Parse (F);
      F.Close;

      Open ("grammar2.xsd", F);
      S.Parse (F);
      F.Close;

      G := S.Get_Grammar;

      R.Set_Grammar (G);
      Open ("file.xml", F);
      R.Parse (F);
      F.Close;
   end;

If however you are not using namespaces, you cannot use this
technique, since the grammar from the various XSD files would end up
mixed up, and validation will most likely fail. So instead you need to
have one *XML_Grammar* per XSD file, and then set the grammar on the
reader dynamically. A full example is given in the XML/Ada source
distribution, in "tests/schema/multiple_xsd". Here is an overview.

We first need to parse each of the XSD files into its own grammar:

   declare
      Symbols : Symbol_Table;
      G1, G2 : XML_Grammar;
      S      : Schema_Grammar;
      F      : File_Input;
   begin
      --  Since we are going to reuse grammars, we need to ensure their
      --  symbol tables (where internal strings are stored) across all
      --  involved parsers).
      Symbols := Allocate;
      S.Set_Symbol_Table (Symbols);

      --  Now we can parse each of the XSD file
      Open ("algo1.xsd", F);
      S.Parse (F);
      F.Close;
      G1 := S.Get_Grammar;

      S.Set_Grammar (No_Grammar);  --  reset
      Open ("algo2.xsd", F);
      S.Parse (F);
      F.Close;
      G2 := S.Get_Grammar;
   end;

We then need to create a custom validating reader, which knows how to
set the grammar based on its name. This is done by overriding one of
the primitive operations of the parser:

   declare
      type My_Reader is new Validating_Reader with null record;
      overriding procedure Parse_Grammar
         (Self          : not null access Reader_With_Preloaded_XSD;
          URI, Xsd_File : Sax.Symbols.Symbol;
          Do_Create_NFA : Boolean := True) is
      begin
         if Xsd_File = "algo1.xsd" then
            Self.Set_Grammar (G1);
         elsif Xsd_File = "algo2.xsd" then
            Self.Set_Grammar (G2);
         end if;
      end Parse_Grammar;

      R : My_Reader;
   begin
      --  Also share the same symbol table
      R.Set_Symbol_Table (Symbols);
      R.Set_Feature (Schema_Validation_Feature, True);

      Open ("test1.xml", F);
      R.Parse (F);
      F.Close;
   end;

Where for instance "test1.xml" contains:

   <?xml version="1.0" encoding="UTF-8"?>
   <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="algo1.xsd">
      <child>102</child>
   </root>

Generated by dwww version 1.15 on Tue Jul 2 00:35:35 CEST 2024.