5 ways to make XML more XPath friendly

As java developers we should always do what we can to optimize outbound xml from our side of the fence. I mean, its our job to build and design awesome, elegant and efficient software whenever possible right? We have our data and we want to serialize it into xml, how can we make our xml as efficient and xpath friendly as possible?

1) Keep the structure simple

Consolidate data nodes whenever possible before marshaling your object into xml. You really don’t want to have to resort to using xpath for any unnecessary lookups across nodes. Keep those lookups confined to the persistence layer where they belong. I realize this may not always be possible, but keeping it to a minimum will really help the xsl developers by not forcing them to create ridiculous xpath expressions in order to link discretely separate data nodes. We want to keep the xpath logic as simple as possible.

Consider the following xml document:

<root>
	<house id="1" color="white">
	<room name="livingRoom" houseId="1" windows="3"/>
	<room name="kitchen" houseId="1" windows="2"/>
	<room name="bedroom" houseId="1" windows="4"/>
	<room name="bathroom" houseId="1" windows="1"/>
</root>

If this is how your xml is structured, and you wanted to transform this data with xsl you are creating all kinds of extra work and could end up causing an unnecessary performance bottleneck for your application. If you wanted to do something like lay out the total number of rooms for the house with id of 1, your xpath would have to look something like this:

count(/root/house[@id=1]/room)

You are now making your xsl developer implement logic to count all the rooms for a particular house node, using an xpath function and selector conditional logic to filter a global count. Just because you can use the count function does not mean you should use it every chance you get. This xpath expression will traverse all of your xml and count the number of nodes whose house node is 1, and return the total number of room nodes. Its not much of a problem if your xml is one or two house nodes deep, but what if you had a like 10, 20 or even 30 house nodes or more? If you are processing numbers like these, and then you span these node traversal across say a hundred requests, you would be doing something like 3,000 traversals. What if instead you used an xml structure like this:

<root>
	<house id="1"  totalRooms="4" color="white">
		<room name="livingRoom" windows="3"/>
		<room name="kitchen" windows="2"/>
		<room name="bedroom" windows="4"/>
		<room name="bathroom" windows="1"/>
	</house>
</root>

In this example we attached the room count to the house node as an attribute. This way, our xpath expression ends up looking like this:

/root/house/@totalRooms

No count function, no selector conditional logic to filter; you end up with a simple, basic xpath expression. You’re doing a single lookup instead of an entire 3,000 node traversal while collecting a node count that has to be calculated as the transformation is processing. Let the data/persistence layer perform those types of counts and populate all the data, and let your xsl/xpath lay out your data. Keep the structure simple. If this is not possible, you might be doing something wrong.

2) Avoid namespaces whenever possible

Namespaces are important, yes. But if you are trying to express a java object as xml, prefer using a different name altogether than attaching namespaces. This really comes down to creating a more specific, descriptive naming scheme for your application’s objects. If you are using namespaces just for the heck of it, I’d urge you not to. You end up adding lots of unnecessary noise to your xpath, and having those colons all over the place in your xml can make the document look like its been processed through a cheese grater. Anyone trying to read your xml will cringe at the thought and will want to pass it off to the new guy as part of the hazing ritual. Consider the following xml:

<root>
	<myHouseNamespace:house id="1"  totalRooms="4" color="white">
		<myHouseNamespace:room name="livingRoom" windows="3">
			<myHouseNamespace:furniture name="sofa" type="leather"/>
			<myHouseNamespace:furniture name="table"/>
			<myHouseNamespace:furniture name="lamp"/>
		</myHouseNamespace:room>
		<myHouseNamespace:room name="kitchen" windows="2"/>
		<myHouseNamespace:room name="bedroom" windows="4"/>
		<myHouseNamespace:room name="bathroom" windows="1"/>
	</myHouseNamespace:house>
</root>

This makes your xpath look like this:

/root/myHouseNamespace:house/myHouseNamespace:room[@name=’livingroom’]/@windows

I don’t know about you, but this xpath expression is hard to read with all the “myHouseNamespace:” junk all over the place. And we only went 2 nodes deep into the tree. A third level down would have marched the xpath expression across the width of this blog! Who loves mile long xpath expressions that add side scrollbars to your text editor on your widescreen monitor? NO ONE.

3) Use attributes wherever appropriate

There is really no difference between using an attribute and a single text node to express a simple value in xml. In other words, there is nothing different between this:

<root>
	<house>
		<id>1</id>
		<totalRooms>4</totalRooms>
		<color>white</color>
	</house>
</root>

and this:

<root>
	<house id="1"  totalRooms="4" color="white"/>
</root>

So why prefer attributes? Because it makes your xml document legible. Having a ton of single child text nodes adds a lot of noise to your document especially if they are digit or single word attributes. It becomes easy to mix up your text values with real, slightly more complex nodes of data in your xml tree. Readability is important and keeping your xml as efficient expressed as possible without sacrificing readability is paramount to making it understandable and easier to work with.

4) Clearly name your xml nodes

Use names that actually describe the data being represented. Stay away from these kinds of names:

<root>
	<h id="1"  totalRooms="4" color="white">
		<r name="livingRoom" windows="3">
			<f name="sofa" type="leather"/>
			<f name="table"/>
			<f name="lamp"/>
		</r>
		<r name="kitchen" windows="2"/>
		<r name="bedroom" windows="4"/>
		<r name="bathroom" windows="1"/>
	</h>
</root>

Sure, your xpath will be much shorter :

/root/h/r[@name=’livingroom’]/f[@name=’sofa’]/@type

But unless you’re very, very comfortable with the single letter naming convention, you might end up having a hard time keeping track of all the nodes since they’re small and easy to overlook. Descriptive, concise names help make your xml easier to learn or come back to if the data is named in a clear, self explanatory way. Ideally, your xml should have names that the untrained eye should be able to pick up, and make sense of the basic structure with little preparation.

5) Make your xml as human readable as possible

This point encapsulates all the others. XML is a language meant to be tied very closely to data, and our ability to understand that data will allow us as developers to mold it into whatever we want it to be. If you’ve ever had to sit through a very complex piece of xml, you’ll realize that forcing anyone to have to muck through an unconventional structure and non-intuitive names ends up breaking up the xslt development into three phases:

1) figuring out/understanding the xml
2) implementing the xsl
3) figuring out/understanding the xml, and what was just implemented

The longer the developer has to sit there and meddle with #1, and #3, the more time is lost in bringing your product to the next release. We want to spend as little time as possible figuring out or compensating for poorly structured xml so the real implementation work can be completed, and we can move on to the next thing.

In other words

The bottom line is if you structure your xml so that its easy to understand by humans, and it doesn’t cut corners by passing data lookups or counts to the xsl developer, your application will become much more efficient, well written, and easier to work with. This is a good place to exercise the separation of concerns principle, let the data/persistence layer do what it does best, and let the xml/xslt layer do what it does best. XSLT is a relatively expensive process, but even with infinite computing resources we should always strive to make the most of out whatever resources we can allocate.

Comments (0)

› No comments yet.

Leave a Reply

Allowed Tags - You may use these HTML tags and attributes in your comment.

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Pingbacks (0)

› No pingbacks yet.