Advanced Guide

This guide presents some of the more advanced features supported by declxml that can be useful when they are needed.

XPath Syntax

declxml supports a very small subset of XPath syntax that enables greater expressiveness when defining processors.

The Dot (.) Selector

The dot (.) selector can be used in a processor to refer to the parent processor’s element. For instance, the dot operator can be used to refer to attributes on a childless element.

>>> import declxml as xml
>>> from pprint import pprint

>>> books_xml = """
... <books>
...     <book title="I, Robot" author="Isaac Asimov" />
...     <book title="Foundation" author="Isaac Asimov" />
...     <book title="Nemesis" author="Isaac Asimov" />
... </books>
... """

>>> books_processor = xml.array(xml.dictionary('book', [
...     xml.string('.', attribute='title'),  # Select the attribute "title" on the element "book"
...     xml.string('.', attribute='author'),
... ]), nested='books')


>>> pprint(xml.parse_from_string(books_processor, books_xml))
[{'author': 'Isaac Asimov', 'title': 'I, Robot'},
 {'author': 'Isaac Asimov', 'title': 'Foundation'},
 {'author': 'Isaac Asimov', 'title': 'Nemesis'}]

The dot operator can also be used to group an element’s attribute values with the values of the element’s children.

>>> import declxml as xml
>>> from pprint import pprint

>>> author_xml = """
... <author name="Liu Cixin">
...     <book>The Three Body Problem</book>
...     <book>The Dark Forest</book>
...     <book>Deaths End</book>
... </author>
... """

>>> author_processor = xml.dictionary('author', [
...     xml.string('.', attribute='name'),
...     xml.array(xml.string('book'), alias='books')
... ])

>>> pprint(xml.parse_from_string(author_processor, author_xml))
{'books': ['The Three Body Problem', 'The Dark Forest', 'Deaths End'],
 'name': 'Liu Cixin'}

The Path (/) Selector

The path selector (/) can be used to select descendant elements which can be useful for flattening out deeply nested XML data.

>>> import declxml as xml
>>> from pprint import pprint

>>> hugo_xml = """
... <awards>
...     <hugo>
...         <winners>
...             <winner>
...                 <year>2017</year>
...                 <book>
...                     <title>The Obelisk Gate</title>
...                     <author>N. K. Jemisin</author>
...                 </book>
...             </winner>
...             <winner>
...                 <year>2016</year>
...                 <book>
...                     <title>The Fifth Season</title>
...                     <author> N.K. Jemisin</author>
...                 </book>
...             </winner>
...             <winner>
...                 <year>2015</year>
...                 <book>
...                     <title>The Three Body Problem</title>
...                     <author>Liu Cixin</author>
...                 </book>
...             </winner>
...         </winners>
...     </hugo>
... </awards>
... """

>>> hugo_processor = xml.array(xml.dictionary('winner', [
...     xml.integer('year'),
...     xml.string('book/title', alias='title'),
...     xml.string('book/author', alias='author'),
... ]), nested='awards/hugo/winners')

>>> pprint(xml.parse_from_string(hugo_processor, hugo_xml))
[{'author': 'N. K. Jemisin', 'title': 'The Obelisk Gate', 'year': 2017},
 {'author': 'N.K. Jemisin', 'title': 'The Fifth Season', 'year': 2016},
 {'author': 'Liu Cixin', 'title': 'The Three Body Problem', 'year': 2015}]

The data will be serialized back into the deeply nested XML structure if the processor is used to perform serialization.

It is highly recommended to provide aliases when using XPath syntax to ensure that when a value is parsed and assigned a name (e.g. a field of a dictionary, object, or namedtuple), the name of the value is a valid Python identifier without any ‘.’ or ‘/’ characters.

Hooks

Hooks are an advanced feature that allow arbitrary code to be executed during the parsing and serialization process. A Hooks object is associated with a processor and contains two functions: after_parse and before_serialize.

Both of these functions (which can be any callable object) are provided two parameters and should return a single value. The first parameter provided to both functions is a ProcessorStateView object which contains information about the current state of the processor when the function is invoked.

The after_parse function is invoked after a processor has parsed a value from the XML data. The second parameter provided to the after_parse function is the value parsed by the processor from the XML data. The after_parse function must return a single value which will be used by the processor as its parse result. The value returned by after_parse replaces the value parsed from the XML data as the processor’s parse result.

The before_serialize function is invoked before a processor serializes a value to XML. The second parameter provided to the before_serialize function is the value to be serialized by the processor to XML. The before_serialize function must return a single value which the processor will serialize to XML. The value returned by before_serialize replaces the value provided to the processor to serialize to XML.

There are three intended use cases for hooks (though since hooks can be any arbitrary callable objects, there should be flexibility for other use cases):

  • Value transformations
  • Validation
  • Debugging

Value Transformations

Sometimes it is useful to be able to transform values read from XML during parsing into a shape more convenient for the application to use and transform values during serialization back into shapes that better fit the XML structure.

Hooks can be used to achieve this by simply returning the transformed value from the after_parse and before_serialize functions. This works because whatever value a processor was going to use for parsing or serialization is replaced by the value returned by after_parse or before_serialize.

As a basic example, if we want to make sure all strings read from an XML document are uppercase when used in our application and lowercase when written to XML, we could use hooks to perform value transformations.

>>> import declxml as xml

>>> hooks = xml.Hooks(
...     after_parse=lambda _, x: x.upper(),
...     before_serialize=lambda _, x: x.lower()
... )

>>> processor = xml.dictionary('data', [
...     xml.string('message', hooks=hooks),
... ])

>>> xml_string = """
... <data>
...    <message>hello</message>
... </data>
... """

>>> xml.parse_from_string(processor, xml_string)
{'message': 'HELLO'}

>>> data = {'message': 'GOODBYE'}
>>> print(xml.serialize_to_string(processor, data, indent='    '))
<?xml version="1.0" encoding="utf-8"?>
<data>
    <message>goodbye</message>
</data>

When using hooks to perform value transformations, it is a good idea to ensure that the transformations performed by after_parse and before_serialize are inverse operations of each other so that parsing and serialization work correctly when using transformed values. This is particularly important when values are transformed into different types.

Validation

By default, declxml only performs a very basic level of validation by ensuring that required values are present and that they are of the correct type. Hooks provide the ability to perform additional, application-specific validation.

When performing validation, we can use the ProcessorStateView object provided as the first parameter to the after_parse and before_serialize functions. The ProcessorStateView object provides a useful method, raise_error, for reporting errors. This method will raise an application-provided exception with a custom error message and will include information about the current state of the processor in the error message.

For example, if we want to ensure that integer values are in a specific range, we could use hooks to perform the validation.

>>> import declxml as xml

>>> def validate(state, value):
...     if value not in range(1, 4):
...         state.raise_error(
...             RuntimeError,
...             'Invalid value {}'.format(value)
...         )
...
...     # Important! Don't forget to return the value
...     return value

>>> hooks = xml.Hooks(
...     after_parse=validate,
...     before_serialize=validate
... )

>>> processor = xml.dictionary('data', [
...     xml.integer('value', hooks=hooks),
... ])

>>> xml_string = """
... <data>
...     <value>567</value>
... </data>
... """

>>> xml.parse_from_string(processor, xml_string)
Traceback (most recent call last):
...
RuntimeError: Invalid value 567 at data/value

>>> data = {'value': -90}
>>> xml.serialize_to_string(processor, data)
Traceback (most recent call last):
...
RuntimeError: Invalid value -90 at data/value

When using hooks for validation, it is important to remember to return the value from the before_parse and after_serialize functions since the processor will used the value returned by those functions as its parsing result and the value to serialize to XML, respectively.

Debugging

Hooks can also be used to debug processors. We can use the ProcessorStateView object provided to the before_parse and after_serialize functions to include information about which values are received in which locations in the XML document.

>>> import declxml as xml

>>> def trace(state, value):
...     print('Got {} at {}'.format(value, state))
...
...     # Important! Don't forget to return the value
...     return value

>>> hooks = xml.Hooks(
...     after_parse=trace,
...     before_serialize=trace
... )

>>> processor = xml.dictionary('data', [
...     xml.integer('value', hooks=hooks),
... ], hooks=hooks)

>>> xml_string = """
... <data>
...     <value>42</value>
... </data>
... """

>>> xml.parse_from_string(processor, xml_string)
Got 42 at data/value
Got {'value': 42} at data
{'value': 42}

>>> data = {'value': 17}
>>> print(xml.serialize_to_string(processor, data, indent='    '))
Got {'value': 17} at data
Got 17 at data/value
<?xml version="1.0" encoding="utf-8"?>
<data>
    <value>17</value>
</data>