Tuesday, July 14, 2009

Xpath Struggles

Warning: This is a note about coding in C#. If that kind of thing is boring to you, don't read any more. It's also a little long.
This information was difficult to find, so I figured I would post it in case anyone else needed it in the future, including myself.


Giving an xml file like the following, I needed to write a parser(in C#) that would verify the existence of tags, and store them in a class.

<feed xmlns="http://www.w3.org/2005/Atom" lang="en-US">
<entry>
<updated>
2008-09-05T17:05:01-07:00
</updated>
<link rel="self" href="http://www.google.com">
<link rel="alternate" href="http://www.netflix.com">
<link rel="assetlinklogo" href="http://www.google.com%5Cimagefile.gif">
<link rel="assetlinkdemo" href="http://www.netflix.com%5Cdemo.aspx">
<link rel="assetlinkscreenshot1" href="http://www.google.com%5Cscreenshot1.gif">
<link rel="assetlinkscreenshot2" href="http://www.netflix.com%5Cscreenshot2.gif">
<averagerating xmlns="http://testserver:8080/syndicate.xsd">
One
</averagerating>
</entry>
</feed>


There are 2 major problems.

The xml file resides in a "http://www.w3.org/2005/Atom" namespace, however specific tags reside in a different namespace: "http://testserver:8080/syndicate.xsd". Dealing with one namespace was annoying enough, how do we juggle two?

Also the assetlink tags are optional, there may be one hundred or none. I wanted to be able to put them in a list, but to exclude the self and alternate links which are required and will always show up.

To store the data I used a struct like this:

public struct BasicData
{
public XmlNode feed;
public XmlNode updated;
public XmlNode selfLink;
public XmlNode altLink;
public XmlNodeList assetLinks;
public XmlNode AverageRating;
};


I was used to reading xml files with XmlDocument (in the using System.Xml namespace):

BasicData basicData = new BasicData();
XmlDocument xmlFile = new XmlDocument();
XmlNode root;
xmlFile.Load(filePath);
root = this.xmlFile.DocumentElement;


"root" is now pointing to the base node in the xml document, in the case of this example it is the node. However we cannot access the children using xpath, only something like root.GetFirstChild() and GetNextSibling(). The normal xpath query would appear something like this:

basicData.updated = xmlFile.SelectSingleNode(@"/feed/entry/updated");

This function call is null because all the tags reside in a namespace. To deal with this we need to create a namespace manager and give our namespace a custom name(sometimes xml namespaces will be declared with names, but not in this case). This namespace manager is done as follows:

XmlNamespaceManager nsmgr;
nsmgr = new XmlNamespaceManager(this.xmlFile.NameTable);
nsmgr.AddNamespace("atom", @"http://www.w3.org/2005/Atom");
nsmgr.AddNamespace("ppt", @"http://testserver:8080/syndicate.xsd");


Remember I pointed out the file has 2 namespaces? We need to add both of them to the namespace manager since we will need both. Now that we have a namespace manager and the namespaces added, we can access the updated node and the self link node as follows:

basicData.updated = xmlFile.SelectSingleNode(@"/atom:feed/atom:entry/atom:updated", nsmgr);
basicData.selfLink = xmlFile.SelectSingleNode(@"/atom:feed/atom:entry/atom:link[@rel='self']", nsmgr);

And to access the node with the second namespace it is as follows:

basicData.updated = xmlFile.SelectSingleNode(@"/atom:feed/atom:entry/ppt:AverageRating", nsmgr);

Using the second namespace is done by using both in a single xPath.

Now to get the asset links but not the other links there is actually a "starts-with" function in xPath. In this case you would use it as follows:

basicData.assetLinks = xmlFile.SelectNodes(@"atom:feed/atom:entry/atom:link[starts-with(@rel, 'assetlink')]", nsmgr);

And there you go! The full code for this will look like this:

public struct BasicData
{
public XmlNode feed;
public XmlNode updated;
public XmlNode selfLink;
public XmlNode altLink;
public XmlNodeList assetLinks;
public XmlNode AverageRating;
};

BasicData basicData = new BasicData();
XmlDocument xmlFile = new XmlDocument();
XmlNode root;
xmlFile.Load(filePath);
XmlNamespaceManager nsmgr;
nsmgr = new XmlNamespaceManager(this.xmlFile.NameTable);
nsmgr.AddNamespace("atom", @"http://www.w3.org/2005/Atom");
nsmgr.AddNamespace("ppt", @"http://testserver:8080/syndicate.xsd");

basicData.updated = xmlFile.SelectSingleNode(@"/atom:feed/atom:entry/atom:updated", nsmgr);
basicData.selfLink = xmlFile.SelectSingleNode(@"/atom:feed/atom:entry/atom:link[@rel='self']";
basicData.assetLinks = xmlFile.SelectNodes(@"atom:feed/atom:entry/atom:link[starts-with(@rel, 'assetlink')]", nsmgr);

...(and so on)...

1 comment:

Noël said...

Goodness, on facebook and now here too? You must be really trying to get the word out.