0

I want to extract content from an HTML using xpaths using Java. In ruby I can do this using nokogiri as shown here.

xpath = '/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p'  
doc = Nokogiri::HTML(open('test_001_html64.html'))  
doc.xpath().each do |link|  
puts link.content  
end  

I want to do it in pure Java. I looked at Jsoup but I couldn't find any documentation or example that uses an xpath to do this. Can someone suggest a way?

Thanks

Mir
  • 670
  • 4
  • 9
  • 20
  • Many related / duplicates of this question - see http://stackoverflow.com/questions/9022140/using-xpath-contains-against-html-in-java http://stackoverflow.com/questions/3352594/querying-an-html-page-with-xpath-in-java http://stackoverflow.com/questions/3361263/library-to-query-html-with-xpath-in-java – Mark Butler Jan 07 '13 at 00:43

3 Answers3

1

You can use HtmlUnit for that task.

Here's a simple example:

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://www.google.com/");
List<DomNode> nodes = page.getByXPath("/html/body/div/div[2]/div[2]/div/div[2]/div[3]/p");
for (DomNode node : nodes) {
    System.out.println(node.getNodeName());
}
bezmax
  • 25,562
  • 10
  • 53
  • 84
1

Here's how you can do it with JAXP (bundled in Java): JAXP Manual

bezmax
  • 25,562
  • 10
  • 53
  • 84
-2

You can easily do this in jsoup.

Document doc = Jsoup.connect("test_001_html64.html").get();
Elements info = doc.getElementsByTag("html");
//iterate recursively to the desired location in the dom tree.

For faster parsing, you can use specific tags/ ids.

The documentation for jsoup (jsoup.org/apidocs) also exists.

Sumit Bisht
  • 1,507
  • 1
  • 16
  • 31