I'm scraping a website for certain pieces of information. The piece of HTML that I am looking for is the following:
1. <div class="data">
2. <a class="anchor" name="123"></a>
3. <a class="image_link" id="image_id" href="http:/link1">
4. <img class="mainimg" id="456" src="http://link2" alt="description" title="title" >
5. </a>
6. </div>
The webpage has of course lots of these <div class="data">
and I want to scrape all of them for the following information:
- name=123 (from line 2)
- href=link1 (from line 3)
- src=http://link2, alt=description (from line 4)
I'm able to do this but using 3 different xPath expressions, like so:
Object[] o1 = node.evaluateXPath("//div[@class='data']/a/img");
Object[] o2 = node.evaluateXPath("//div[@class='data']/a[@class='image_link']");
Object[] o3 = node.evaluateXPath("//div[@class='data']/a[@class='anchor']");
and then getting each attribute, like for example:
((TagNode)o1[i]).getAttributeByName("src");
This works, but I'm going through the same HTML data 3 times and ending up with 3 different and separate data structures with the information I need.
How can I optimize this with only 1 xpath expression? Thanks.