6

Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.

Thank you.

Leonardo Marques
  • 3,721
  • 7
  • 36
  • 50
  • See http://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java - not quite a duplicate as it asks about specific XPath functionality but there are better answers there. – Mark Butler Jan 07 '13 at 00:34
  • @Reonarudo I am in the same situation as you were when you asked this question. There are many possible suggestions/solutions in the answers, but I would like to know which solution(library) you used and did it work out the way you wanted it ? – Uther Pendragon Jun 20 '15 at 19:08
  • @UtherPendragon I'm sorry but this was a long time ago and I cannot recall which project was this. Anyway there should be newer/better libraries available nowadays. – Leonardo Marques Jun 23 '15 at 12:14

5 Answers5

8

There are several different approaches to this documented on the Web:

Using HtmlCleaner

Using Jericho

I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.

Community
  • 1
  • 1
Mark Butler
  • 4,361
  • 2
  • 39
  • 39
6

jsoup, Java HTML Parser Very similar to jQuery syntax way.

Artem Barger
  • 40,769
  • 9
  • 59
  • 81
  • I'm not sure. It does much simpler queries, which xpath based. you can read some documentation and there are a lot of cool examples, explaining how to run that queries. – Artem Barger Jul 31 '10 at 08:17
  • 5
    jsoup (at least in version 1.7.3) doesn't suppport XPath. – brabec Jan 11 '14 at 20:30
  • jsoup use css/jQuery syntax way ,which is similar as and better than XPath – phil Mar 01 '14 at 06:02
  • 17
    CSS Selectors are not better than XPath. There are some things which you can select in XPath but not CSS Selectors – Neil McGuigan Jul 19 '16 at 23:09
  • jsoup now supports xpath, as well as CSS selectors. Since September 2021 in [jsoup 1.14.3](https://jsoup.org/news/release-1.14.3). – Jonathan Hedley Oct 06 '22 at 11:47
1

You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
1

Use Xsoup. According to the docs, it's faster than HtmlCleaner. Example

 @Test
    public void testSelect() {

        String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
                "<table><tr><td>a</td><td>b</td></tr></table></html>";

        Document document = Jsoup.parse(html);

        String result = Xsoup.compile("//a/@href").evaluate(document).get();
        Assert.assertEquals("https://github.com", result);

        List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
        Assert.assertEquals("a", list.get(0));
        Assert.assertEquals("b", list.get(1));
    }

Link to Xsoup - https://github.com/code4craft/xsoup

bigbounty
  • 16,526
  • 5
  • 37
  • 65
0

I've used JTidy to make HTML into a proper DOM, then used plain XPath to query the DOM.

If you want to do cross-document/cross-URL queries, better use JTidy with XQuery.

Tassos Bassoukos
  • 16,017
  • 2
  • 36
  • 40