Scraping HTML with XPath in a single run

Question

I'm scraping a website for certain pieces of information. The piece of HTML that I am looking for is the following:

1. <div class="data">
2.   <a class="anchor" name="123"></a>
3.   <a class="image_link" id="image_id" href="http:/link1">
4.     <img class="mainimg" id="456" src="http://link2" alt="description" title="title" >
5.   </a>
6. </div>

The webpage has of course lots of these <div class="data"> and I want to scrape all of them for the following information:

name=123 (from line 2)
href=link1 (from line 3)
src=http://link2, alt=description (from line 4)

I'm able to do this but using 3 different xPath expressions, like so:

Object[] o1 = node.evaluateXPath("//div[@class='data']/a/img");
Object[] o2 = node.evaluateXPath("//div[@class='data']/a[@class='image_link']");
Object[] o3 = node.evaluateXPath("//div[@class='data']/a[@class='anchor']");

and then getting each attribute, like for example:

((TagNode)o1[i]).getAttributeByName("src");

This works, but I'm going through the same HTML data 3 times and ending up with 3 different and separate data structures with the information I need.

How can I optimize this with only 1 xpath expression? Thanks.

score 0 · Answer 1 · edited May 23 '17 at 12:20

Take the union of two expressions:

//div[@class='data']/a/img/@*[name()='src' or name()='alt'] |
//div[@class='data']/a/@*[(parent::*/@class='image_link' and name()='href') or
                          (parent::*/@class='anchor' and name()='name')]

You could also avoid the ugliness of parent::* by splitting the second expression in two:

//div[@class='data']/a/img/@*[name()='src' or name()='alt'] |
//div[@class='data']/a[@class='image_link']/@href |
//div[@class='data']/a[@class='anchor']/@name

Either of these returns a node-set containing only attribute nodes. You'll still need to iterate those nodes. Execute the XPath in Java like this (where expression is either of the two above):

NodeList node = (NodeList) xpath.evaluate(expression, doc, 
        XPathConstants.NODESET);
for (int i = 0; i < node.getLength(); i++) {
    Node attr = node.item(i);
    System.out.println(attr.getNodeName() + ": " + attr.getNodeValue());
}

Output:

name: 123
href: http:/link1
alt: description
src: http://link2

Edit: I just noticed that your sample code references a TagNode, so I suspect you might actually be using HTMLCleaner. You can try to evaluate the XPath using HTMLCleaner's built-in methods, but it's (apparently) not a compliant XPath processor, so the result is unpredictable. See this post for how to first turn the HTMLCleaner result into a W3C DOM Document and evaluate the XPath using the standard Java methods:

https://stackoverflow.com/a/8567674/592746

The union of two or more expressions starting with `//` will most probably require two or more complete traversals of the XML tree. I believe that the OP is asking exactly is there a way to select all nodes in just a single pass over the XML tree. — Dimitre Novatchev, Dec 20 '11 at 04:00
@DimitreNovatchev - Agreed. I think the OP has three concerns: 1) multiple expressions are needed; 2) multiple data structures are created and must be inspected; 3) multiple passes of the document are required. My solution addresses #1 and #2; I'm not sure #3 can be addressed in a single expression. — Wayne, Dec 20 '11 at 04:03
First of all, thanks for answering. My main concern is to optimize my current code since it's taking about 3 seconds to parse the whole document with my current approach. I'll try taking the union of the expressions as well as not using HTMLCleaner to see if there's any improvements in the performance. — Henrique, Dec 20 '11 at 11:37
Did some testing and apparently HTMLCleaner was the bottleneck. By switching to JSoup and keeping everything pretty much the same (I still tranverse the document 3 times), the time it took to parse everything went down by 50%. — Henrique, Dec 21 '11 at 14:08

Scraping HTML with XPath in a single run

1 Answers1