Thursday, September 12, 2013

Use XPath Expressions to Refine Data Selection

In this era of big data we often need to simplify analysis and communication by creating customized views into sections of a larger file. XPath, short for XML Path Language, is designed to allow users to address parts of large XML documents. XMLSpy supports XPath with an XPath Analyzer window and in its interface for charting XML data, MapForce supports XPath functions for XML data mapping, and StyleVision supports XPath in conditional templates, extension templates, and template filtering.

Altova tools implement the same user interface wherever possible, so skills developed in one tool are easily transferrable. We’ve blogged before about XPath autocalculate in StyleVision, so let’s look at a different example.

Very often a large dataset contains a number of individual tables, such as this example of information about employment of recent college graduates from the US Department of Education, displayed here as a CSV file:

CSV view of employment data for college graduates

Each table describes one employment statistic for graduates in different majors and years.

An XML Schema to describe the data might look like this in XMLSpy Schema View, where each individual table is described in the <set> element, and expanding the <row> element would display additional child elements for each college major or field of study.

XMLSpy Schema view

A simple MapForce data mapping from the CSV file to the XML Schema converts the data to XML format, which we can open in XMLSpy:

XMLSpy text view of a portion of the data

If we want to create a graph for one of the charts, it’s easiest to start from XMLSpy Grid view, where we can select one sub-table, then choose New Chart from the right-click context menu:

XMLSpy Grid View of the XML data

In the Chart Data Selection dialog that opens next, XMLSpy automatically generates an XPath expression describing the selected data:

XMLSpy Chart Data Selection dialog

The corresponding line chart looks like this:

XMLSpy chart of the data selected in Grid View

To graph a different table, we could go back into the data and manually select a different set, or we can simply edit the XPath expression. The original expression in the data selection window was: /data/set[3]/row

We can change the number 3 to select a different set, or we can take advantage of the setnum attribute to select a table by its contents, not its sequence in the file. This would be a better way to explicitly choose one table in case the data changes later. For instance, in a future iteration a table could be inserted in the second position for part-time employment, changing the sequence numbers of all the following tables.

An XPath expression that selects data by using the setnum attribute to display the chart for percent of grads employed in a field closely related to their studies looks like this:

Editing the XPath expression in the chart data selection dialog

The corresponding chart showing the percentage of employees working in jobs closely related to their fields of study looks like this:

XMLSpy line chart for graduates employed close to their fields of study

If you want to experiment with various options as you develop an XPath expression, the XPath Analyzer window is a handy tool. The XPath Analyzer checks syntax as you type and previews the result:

XMLSpy XPath Analyzer window

The next step is to take the charts and XPath data selection expressions developed in XMLSpy and use them in StyleVision to create a stylesheet for data reporting. We’ll cover exactly that topic in a future post.

In the meantime, you can click here to download a free trial of the Altova MissionKit, including XMLSpy, MapForce, and StyleVision and get started for yourself!

1 comment:

DaveMcG said...

A follow-up to this post titled XPath Expressions for Data Reporting explains how to apply the XPath Expression in an interactive data report.