Using XPath to extract XOM elements from documents with unnecessary namespaces

Question

I'm trying to parse some HTML returned by an external system with XOM. The HTML looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body>
  <div>
    Help I am trapped in a fortune cookie factory
  </div>
</body>
</html>

(Actually it's significantly messier, but it has this DOCTYPE declaration and these namespace and language declarations, and the HTML above exhibits the same problem as the real HTML.)

What I want to do is extract the content of the <div>, but the namespace declaration seems to be confusing XPath. If I strip out the namespace declaration (by hand, from the file), the following code finds the <div>, no problem:

Document document = ...
Nodes divs = document.query("//div");

But with the namespace, the returned Nodes has a size of 0.

All right, how about if I strip the namespace programmatically?

Element rootElement = document.getRootElement();
rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());

...looks like it should work, but does nothing. From the javadoc:

This method only removes additional namespaces added with addNamespaceDeclaration.

Okay, I thought, I'll provide the namespace to the query:

XPathContext context = 
    XPathContext.makeNamespaceContext(document.getRootElement());
Nodes divs = document.query("//div", context);

Size still zero.

How about constructing the namespace context by hand?

XPathContext context = context = new XPathContext(
     rootElement.getNamespacePrefix(), rootElement.getNamespaceURI());
Nodes divs = document.query("//div", context);

The XPathContext constructor blows up with:

nu.xom.NamespaceConflictException: 
    XPath expressions do not use the default namespace

So, I'm looking for either:

a way to make this query work, or
a way to programmatically strip the namespace declarations, or
an explanation of the correct approach, assuming both of these are wrong.

Update: Based on Lev Levitsky's answer and the Jaxen FAQ I came up with the following hack:

XPathContext context = new XPathContext(
    "foo", 
    document.getRootElement().getNamespaceURI());
Nodes divs = document.query("//foo:div");

This still seems a bit demented to me, but I guess it's the way Jaxen wants you to do things.

Update #2: As noted below and all over the Internet, this isn't Jaxen's fault; it's just XPath being XPath.

So, while this hack works, I would still like a way to strip the namespace declaration. Preferably without going as far as XSLT.

This is the way XPath works with namespaces, it does not depend on Jaxen: if you want to match something with a namespace you must use an explicit prefix in the XPath — MiMo, Mar 13 '12 at 01:21
Yes, on further reading I see that. So, okay, no blame attaches to Jaxen, but it still seems a bit demented. Or, at best, pedantic, and designed primarily for maximum correctness in unrealistic use cases. — David Moles, Mar 14 '12 at 23:17

score 2 · Answer 1 · answered Apr 02 '13 at 23:17

2

You can write:

Nodes divs = document.query("//*[local-name()='div' and namespace-uri()='http://www.w3.org/1999/xhtml']");

answered Apr 02 '13 at 23:17

peter.murray.rust

37,407
44
153
217

score 1 · Accepted Answer · answered Mar 12 '12 at 20:16

You should either specify the namespace directly with something like

Nodes divs = document.query("//{http://www.w3.org/1999/xhtml}div");

or using prefixes that are mapped to respective namespaces (I guess that is what NamespaceContext is for, but there are no prefixes in your query).

Unfortunately, I don't know how it's implemented in Java, but I can provide a Python example if it helps.

Using XPath to extract XOM elements from documents with unnecessary namespaces

2 Answers2

Linked