0

I'm writing xpaths to select all the links under each category on left sidebar from following page: http://www.indexmundi.com/commodities/'>http://www.indexmundi.com/commodities/

I want to select the link under each category one by one. I've written the following xpath and it is selecting the link under first category(Commodity Price Indices) somehow. But I was wondering how I will select the links under other categories. I want to add a check on h3 tha if it's text is Energy, count and select all the rows before that, then if h3 text is Beverages, count and select all rows between Energy and Beverages

.//*[@id='dlCommodities']/tbody/tr[position()< count(following-sibling::tr/td/h3)-1]/td/a

Here is another xpath: .//*[@id='dlCommodities']/tbody/tr[preceding-sibling::tr/td/h3[. = 'Energy'] and following-sibling::tr/td/h3[. = 'Beverages']]/td/a

It is fulfilling the second requirement i.e. select rows between specific headings but it is missing one node.

Please help me fix these xpaths or suggest a better one.

Thanks

Sibtain Norain
  • 679
  • 2
  • 15
  • 25
  • The page does not actually contain `tbody` elements, but they get added if the HTML is parsed to DOM. Consider http://stackoverflow.com/questions/18241029/why-does-my-xpath-query-scraping-html-tables-only-work-in-firebug-but-not-the if you've got problems with XPath expressions containing `tbody` axis steps. – Jens Erat Feb 07 '14 at 13:14

1 Answers1

1

I understand your actual problem as: Find all links that belong to a given category. For doing so, find the category, and then retrieve all elements before the next category.

You might remove the newlines if you prefer, I added them for readability.

//tr[td/h3="Energy"]/(self::tr, following-sibling::tr[
  . << //tr[td/h3="Energy"]/following-sibling::tr[td/h3][1]
])

If you do not have an XPath 2.0 compatible processor, you cannot use the << operator which test for node order (the current node must precede the next category). An XPath 1.0 solution is even slightly shorter, but in my opinion worse in readability:

//tr[td/h3="Energy"] | //tr[td/h3="Energy"]/following-sibling::tr[
  ./preceding-sibling::tr[td/h3][1][td/h3="Energy"] and not(td/h3)
]

Both queries will select all nodes of a category; to count them wrap them into count(...).

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • Unfortunately the first xpath is not working for me. The second xpath is selecting the required nodes correctly except just one node. It misses the row which contains the category name and selects the very first row of next category. – Sibtain Norain Feb 07 '14 at 17:50
  • Never say "not working for me", always explain why. Do you receive an error message? Does the query yield wrong output? If you do not explain what's going wrong in detail nobody's able to help you. Regarding the missing first item: My bad, didn't realize it one is contained in the same row together with headings. That's also why the first item of the next category was contained, also fixed. See the edit. – Jens Erat Feb 07 '14 at 18:43
  • I tried it in firebug in Firefox and I got "Invalid xpath" error. Then I tried to use it in Scrapy xpath selector [link](http://doc.scrapy.org/en/latest/topics/selectors.html#scrapy.selector.Selector.xpath) and I got the same message in that case as well. – Sibtain Norain Feb 07 '14 at 19:17
  • This is because it's an XPath 2.0 expression. Neither Firefox nor Scrapy support XPath 2.0. Why did you use [tag:xpath-2.0]? – Jens Erat Feb 07 '14 at 19:45