Parsing html -> xml and querying with Xpath

Question

I want to parse a html page to get some data. First, I convert it to XML document using SgmlReader. Then, I load the result to XMLDocument and then navigate through XPath:

//contains html document
var loadedFile = LoadWebPage();

...

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;

sgmlReader.InputStream = new StringReader(loadedFile);

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);

This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:

var node = doc.SelectSingleNode(".//*[@id='results-table']");

This gives me a node with several child nodes:

[0]         {Element, Name="thead"}  
[1]         {Element, Name="tbody"}  
[2]         {Element, Name="tbody"}  
FirstChild   {Element, Name="thead"}

Ok, let's try to get some child nodes using XPath. But this doesn't work:

var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0

This also:

var childNode = node.SelectSingleNode("thead");
// childNode = null

And even this:

var childNode = doc.SelectSingleNode(".//*[@id='results-table']/thead")

What can be wrong in Xpath queries?

I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.

I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:

//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));

XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);

Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.

My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(

+1 for a good question, and for parsing HTML with the Agility Pack and XML parsers rather than regex. — Justin Morgan - On strike, Mar 19 '11 at 03:20
HTML Agility Pack is easy to use, but it has it's own data types, what can be a problem when integrating in an existing logic. — mlurker, Mar 19 '11 at 03:28

score 1 · Answer 1 · answered Jul 02 '11 at 15:14

I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):

<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">

This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().

This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!

XmlDocument doc = new XmlDocument();
doc.Load(@"C:\temp\X.loadtest.xml");

Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
    if (n.NodeType != XmlNodeType.Element) continue;

    if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
    {
        namespaces.Add(n.Name, n.NamespaceURI);
    }
}

// Inspect the namespaces dictionary to write the code below

XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI); 
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder"); 

XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
    // Do stuff
}

score 0 · Answer 2 · answered Mar 19 '11 at 03:14

0

To be honest when I am trying to get information from a website I use regex. Ok Kore Nordmann (in his php blog) thinks, this is not good. But some of the comments tell differently.

http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html

http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

But it is in php, so sorry for this =) Hope it helps anyway.

answered Mar 19 '11 at 03:14

Jakob Alexander Eichler

2,988
3
33
49

1

There are some very good reasons not to try to parse (X)HTML with regex. For one thing, it's literally impossible to do it correctly. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Justin Morgan - On strike Mar 19 '11 at 03:18
In JDownloader in all the decrypter plugins we use regex too. I wrote a BrowserGame bot using regex, ok but anyway, this was the reason why they were able to detect the bot after some years. The installed a gap, and due to regex my bot did not understand, that the page has changed, but this could have been done by regex as well. I just forgot to ensure, to build in a "htmlstructure has not changed" mechanism to avoid beeing detected. – Jakob Alexander Eichler Mar 19 '11 at 03:20
The previous version of my app used Regex inside. It was a nightmare (comparing to Xpath). – mlurker Mar 19 '11 at 03:27

Parsing html -> xml and querying with Xpath

2 Answers2