1

I need to pass some not strictly well-formatted XML through an XPath evaluator. The XML is in fact mostly html, which could like the following:

<p>
  <a href="http://www.something.com/5993810749/" title="IMG_3013”>
    <img src="5993810749_107ea7d465_m.jpg" width="240" height="160" alt="IMG_3013”/>
  </a>
</p>
<p>
  <a href="http://www.something.com/836492365986/" title="IMG_3018”>
    <img src=“8364923659_107ea3286465_m.jpg" width=“365" height=“248" alt="IMG_3018”/>
  </a>
</p>

So, the noticeable problems are that it: has no root element; Also <img> is not terminated. While it is easy to wrap with a root element, when I pass through the XPath evaluator, I get an exception something like:

[Fatal Error] :7:196: The element type "img" must be terminated by the matching end-tag "</img>".

Btw, the code for the XPath Evaluator in Java looks like:

XPath xPath = XPathFactory.newInstance().newXPath();
Object result = xPath.evaluate(xpath,
    new InputSource(new StringReader(xmlString)), XPathConstants.NODESET);

So, I would like to know, what is the best way to deal with this, so that I could successfully evaluate the XML? It seems I have at least two options: (a) try to get the XPath evaluator to be more smart; or (b) try having a way to automatically repair the poorly formatted XML. A solution to this problem would be appreciated!

Larry
  • 11,439
  • 15
  • 61
  • 84
  • If the XML is not well-formed it will not parse. If it won't parse, you can't query it - XPath or otherwise. – Oded Jan 21 '13 at 14:06
  • You can find suitable library here: http://stackoverflow.com/questions/3361263/library-to-query-html-with-xpath-in-java – hoaz Jan 21 '13 at 14:11
  • Ok, so if poorly formatted XML won’t work, at least is there a way to repair the text so that it can parse? – Larry Jan 21 '13 at 14:13
  • What matters is getting a usable DOM tree. There are HTML parsers such as NekoHTML that can parse non-XML HTML documents and produce a suitable DOM that you can then run XPath queries over. One thing to note if you do use Neko is that the element names in the DOM tree will be upper case, so you'll have to use XPaths like `//P/A/IMG` instead of `//p/a/img` – Ian Roberts Jan 21 '13 at 14:14
  • By the way, it's "well-/poorly-formed XML", not "well-/poorly-formatted XML". Two completely different things. – BoltClock Jan 21 '13 at 14:16
  • BoltClock: Ok thanks, have been changed! – Larry Jan 21 '13 at 14:17
  • 1
    This particular snippet looks well-formed except for some invalid quote characters in each `` and the lack of a root element. If those were the only issues, then a simple character replacement and wrapping it in a root node would fix that up. – JLRishe Jan 21 '13 at 14:28
  • JLRishe: Then not sure why I keep getting the fatal error that `` is not terminated. And the incorrect quotes are mainly just copy-pasting to stack-exchange editor. – Larry Jan 21 '13 at 14:34
  • @Larry I don't know. The ``s in your example above are terminated, or would be if it weren't for the invalid quote symbols. And how is copy-pasting producing invalid quote symbols? – JLRishe Jan 21 '13 at 15:00

2 Answers2

0

You could parse the HTML using an HTML parser such as NekoHTML, then run XPath queries over the resulting DOM tree

import org.cyberneko.html.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Node;

DOMParser parser = new DOMParser();
parser.parse(new InputSource(new StringReader(xmlString)));
XPath xPath = XPathFactory.newInstance().newXPath();
Object result = xPath.evaluate(xpath, parser.getDocument(),
      XPathConstants.NODESET);

Note that NekoHTML produces the specific HTML DOM nodes by default, and these report their node names in upper case regardless of the case of the original input tags. Therefore if you want an XPath that will extract all <p> elements then you need //P rather than //p.

Ian Roberts
  • 120,891
  • 16
  • 170
  • 183
0

There are several utilities that will convert HTML or generally anything with angle-brackets into well-formed XML (which might or might not be the XML that you expected, but it will be well-formed). JTidy and TagSoup are often used in this role. You don't have to materialize the XML, you can pump it straight into the next step in your processing pipeline, e.g. an XSLT transformation or schema validation.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164