Extract content using xpath from an HTML doc using pure Java

Question

I want to extract content from an HTML using xpaths using Java. In ruby I can do this using nokogiri as shown here.

xpath = '/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p'  
doc = Nokogiri::HTML(open('test_001_html64.html'))  
doc.xpath().each do |link|  
puts link.content  
end

I want to do it in pure Java. I looked at Jsoup but I couldn't find any documentation or example that uses an xpath to do this. Can someone suggest a way?

Thanks

Many related / duplicates of this question - see http://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java http://stackoverflow.com/questions/3352594/querying-an-html-page-with-xpath-in-java http://stackoverflow.com/questions/3361263/library-to-query-html-with-xpath-in-java — Mark Butler, Jan 07 '13 at 00:43

score 1 · Answer 1 · answered Mar 19 '12 at 08:13

You can use HtmlUnit for that task.

Here's a simple example:

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://www.google.com/");
List<DomNode> nodes = page.getByXPath("/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p");
for (DomNode node : nodes) {
    System.out.println(node.getNodeName());
}

score 1 · Answer 2 · answered Mar 19 '12 at 08:17

1

Here's how you can do it with JAXP (bundled in Java): JAXP Manual

answered Mar 19 '12 at 08:17

bezmax

25,562
10
53
84

score -2 · Answer 3 · answered Mar 19 '12 at 08:18

-2

You can easily do this in jsoup.

Document doc = Jsoup.connect("test_001_html64.html").get();
Elements info = doc.getElementsByTag("html");
//iterate recursively to the desired location in the dom tree.

For faster parsing, you can use specific tags/ ids.

The documentation for jsoup (jsoup.org/apidocs) also exists.

answered Mar 19 '12 at 08:18

Sumit Bisht

1,507
1
16
31

jsoup does not provides an xpath mechanism, but offers a more convenient way.https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser – Sumit Bisht Mar 19 '12 at 11:02
The question is tagged with `xpath`. – bezmax Mar 19 '12 at 13:49

Extract content using xpath from an HTML doc using pure Java

3 Answers3

Linked