OpenOffice.org XML Engine Upgraded

The XForm capability within OpenOffice.org is one of may special interests and I can still see a lot of potential for this feature. However, when first exploring the XForms feature I came across some limitations that all seemed to come down to the fact that only the XPath 1.0 standard was implemented within OpenOffice.org. So what updates are being made and what will it offer?

Currently, the older Xalan XSLT processor is being used within the OpenOffice.org code base. This will be replaced in the upcoming version 3.x (most likely 3.1) package with the more streamlined and feature rich Saxon XSLT processor. This processor implements not only the newer version standards of XPath 2.0 and XSLT 2.0, but it also implements the new XQuery 1.0 standard as well. These are able to offer newer features which can be implemented in to OpenOffice.org capabilities. So what features would be of use here with the new processor?

For loops and distinct

The new for loop feature allows for a powerful tool to traverse through data to apply transformations or functions to particular elements within the data. The for loop operator has its power amplified when combined with the new distinct function which allows you to traverse the data, but only work with distinct elements in the sequence rather than applying the transform to the whole lot of data.

As an example, let see how many items are in a particular table (defined in HTML) and output an indicator based on counts over 3 being found. The following XPath would give the answer:

   for $row in //tr
return
if ( count( $row/td ) > 3 )
then "many"
else "few"

Comments

XPath 2.0 allows the user to add comments using the (: and :) delimiters and yes, they can be nested. For example:

(: Check for at least one book with the name Office :)

some $book in /books/book/name satisfies $book = "Office"

Node Grouping

Along with the new for loops, the newer standards now also offer for-each-group statements which will all the user to solve an old problem existant in XSLT 1.0. In earlier versions, it was hard to rearrange the XML schema to be able to group on different elements within the schema. Now with the for-each-group clause, you can create a transform that will extract out the elements into groupings as you define quite simply.

Aggregation Functions

The existing aggregation functions such as min(),max(), and avg() have all be updated to work with the new features of XSLT 2.0 such as the groupings that were just mentioned. Many of the older capabilities and functions within XSLT 1.0 have been updated to handle the newer streamlined features of XSLT 2.0.

New Output Capabilities

One of the major features of XSLT 2.0 that will be made available is to be able to output files in XHTML format. Currently output is only in XML, HTML and text, but the new 2.0 version offers the more updated XHTML output as well.

Another benefit to XSLT 2.0 is that of multiple document output. In XSLT 1.0, the processor can only out one document after the run, however with XSLT 2.0, it is now possible to produce multiple document with a single run through a stylesheet. This means that you no longer need to use batch processing or specialised scripting to produce different documents. The XSLT 2.0 standard also offers a similar benefit on input which will allow you read multiple documents and they can even be non-XML formats as well.

New Functions and Operators

XPath 2.0 now has support for a number of new functions and operators. Some of the more powerful ones include:

  • intersect – a simple way to check if a node is contained within a node-set
  • every – checks if every (compared to any) node satisfies a set criteria
  • except – all nodes of a node-set except those matching criteria
  • plus more

Here are some examples of the newer functionality and how it makes this easier. Lets assume the variable $books contains a sequence of <book> elements. You could find all books in the sequence that are common to the books in our data document using the INTERSECT operator:

       $books intersect /library/book

We can also find the difference between the two sequences using the EXCEPT operator. This will return all books in $books that were not in our /library/book data:

       $books except /library/book

Some of the new functions available with the new XSLT 2.0 processor will make certain tasks much easier, particularly in terms of the string-processing capabilities. These new functions included in XPath 2.0 are upper-case, lower-case, matches, replace, string-pad and tokenize. These allow for a greater flexibility in how data is handled and how it can be compared.

Regular Expressions

One important point of note in these new string functions is that many of the strings functions such as matches, replace and tokenize, can now use regular expressions to create patterns that will match the text more accurately and easier.

Using regular expression patterns you can define particular text syntax for data input and presentation. For example, I can use \d{2}-\d{4}-\d{4} to match Australian phone numbers of the form 02-9876-1234. These regular expressions are quite powerful and offer advantages such as being able to match text in a node to a particular pattern you want to find, rather than specific words.

Summary

You can see that there are lots of different features coming in to XSLT 2.0 and XPath 2.0 and how much more power and ease of use they offer to the developer. This means more efficient extensions to OpenOffice.org in the future as well as better compatability and import/export options.

The new XQuery 1.0 features are not covered as yet, but I will leave that to another article, but I see a lot of benefits coming with these newer standards being made available in OpenOffice.org in the near future. It will not offer a major change immediately to the user in terms of new features, but as developers take up the new standards and find them easier to use and more powerful, new capabilities will start to filter in to the application either directly or as extensions for the user.

I am looking forward to using the new Saxon processor and hope that other developers will see this as an opportunity to start pushing the power of OpenOffice.org and how it treats documents and data in general. That we can see the new features being implemented sooner rather than later.