Wednesday, January 30, 2008

Content Re-Use in XMLSpy with OOXML and XSLT

The following story appeared initially on the XML Aficionado blog of Altova CEO, Alexander Falk. It has also been published as an Altova Tech Note.

While Open XML may not yet be an ISO standard, it is already standardized by ECMA and - even more important - all documents created by Office 2007 are already stored in Open XML by default, so there is an abundance of documents whose content you can now reuse much more easily and productively than ever before. So instead of waiting for the ISO vote or paying too much attention to all the political battles being fought around it, I want to show you how you can already take advantage of Open XML (sometimes also called OOXML or Office Open XML) today.

This is the first article in a series of blog postings that I plan to write about practical Open XML tips & tricks, so I encourage you to subscribe to my XML Aficionado blog (via RSS or via e-mail), if you haven't already done so. This will ensure that you get future articles from this series automatically as soon as I post them.

So let's look at an Open XML document in our favorite XML Editor. For this example I am going to use a WordprocessingML document (.docx) that I have created with Microsoft Office Word 2007. When I open the .docx file in XMLSpy, I immediately get to see the contents of the package file, which is structured according to the Open Packaging Convention.

That's a fancy way of saying that it is a ZIP file that contains specific files and directories that make up the content, structure, styles, relationships, and other parts of the document. Using XMLSpy's built-in capability to open any ZIP-formatted archive, I can directly browse any directory structures inside the ZIP package, add new files to the package, or open any existing XML file contained in the package:

OOXML1

For the purpose of reusing the content from this WordprocessingML example file, I am going to open the 'document.xml' file, which contains the content of the document.

As soon as I double-click the file in the ZIP archive, the XML is displayed in a separate window just like any other XML document and I can use the powerful grid view or text view features of XMLSpy to view or edit the XML data (sometimes it may be useful to invoke the pretty-print function in text view to make the file more easily readable):

OOXML2

This is, of course, a live editing view, so you can not only view the Open XML data, but make any changes to the XML and save it back into the package file.

But now let's look at how we can easily reuse content from this Open XML document using XSLT. XMLSpy ships with a few Open XML example documents as well as example XSLT stylesheets for just that purpose. Let's look at the 'docx2html.xslt' stylesheet, which takes a WordprocessingML document and extracts all paragraphs to turn them into HTML. This example stylesheet is by no means intended to be a fully-featured conversion tool from .docx to HTML. Instead it serves as a blue-print of how to reuse content from a .docx file and hopefully will serve as a starting point for your stylesheet development efforts.

At the core of that XSLT stylesheet we need a <xsl:for-each> loop to iterate over all the <docx:p> elements, which it turns into simple HTML <p> paragraphs. The text inside the paragraphs is grouped into runs of characters that share common attributes, and so we need an inner <xsl:for-each> loop to iterate over those <docx:r> elements and extract the text from their <docx:t> text node children. Thus the most primitive content reuse that only extracts the text of all paragraphs looks like this:

XSLT1

Once we have constructed those loops, we can start to think about perhaps extracting and reusing some style information. To do that, we now emit a <span> HTML element for every <docx:r> run of characters and give it a style attribute, whose value will depend on the <docx:rPr> element, so we use <xsl:apply-templates> to decide what HTML style we want to apply to the <span> elements:

XSLT2

The corresponding templates for the three most common styles (bold, italic, underline) are trivially easy to construct and look like this:

XSLT3

With just a few lines of XSLT and a few templates we have already written a stylesheet that extracts the basic paragraphs and most important styles from a WordprocessingML document and turns them into HTML that can be viewed in the browser view - here is the result produced from running the above XSLT stylesheet on the example WordprocessingML document that you can find in the XMLSpy examples directory:

OOXML4

Similarly, it is quite easy to extend the stylesheet to extract meta information, other styles, or image information from the WordprocessingML document and reuse the content for any modern application scenario, from web publishing via HTML, RSS, or social media formats to mobile web applications and beyond.

"But wait! How can I apply an XSLT stylesheet to an XML document that is stored within a ZIP file?", you might ask.

You can, of course, extract all the XML files using a regular ZIP expander, but there is a much better solution: when you use the document() function in XSLT 2.0 within XMLSpy or with our royalty-free XSLT engine AltovaXML, you can directly access files contained in a ZIP archive by using the "|zip" pipe operator within the filename, e.g. "MyDocument.docx|zip\_rels\.rels" will address the Relationship file ".rels" in the archive directory "\_rels" inside the ZIP package with the file named "MyDocument.docx".

The benefits of using XSLT to reuse content from Open XML documents are obvious: because XSLT is a cornerstone of the core set of XML standards from the W3C, you can apply all your existing XML, XPath, and XSLT know-how and you can use the excellent tools support that is available for these standards. For example, you can easily develop and debug your XSLT stylesheet using the powerful XSLT debugger in XMLSpy, which allows you to single-step through the transformation, set breakpoints on XSLT instructions or even on data nodes in your Open XML document, view the partially generated output, and inspect the state of the XSLT processor in detail as the output document is constructed:

OOXML3

Using the XSLT Debugger eliminates a lot of the pain that is normally associated with XSLT stylesheet development and allows for a very iterative approach to creating and improving stylesheets that facilitate content reuse and repurposing.

To sum it up, reusing content from Open XML documents for a variety of web applications, mobile scenarios, or social media and Web 2.0 contexts is very easy and can be achieved with standard XML-related technologies, such as XSLT.

For additional information on Open XML and how to take advantage of all the content that is now already available in that format, please refer to the following sites: