How to make an xpath expression read through a part of the document only (Ruby/Nokogiri/xpath)

Question

I use Ruby 1.9.3p385, Nokogiri and xpath v.1.

With help from awesome people on Stackoverflow I have come up with this xpath expression:

products = xml_file.xpath("(/root_tag/middle_tag/item_tag")

to split this XML file:

<root_tag>
  <middle_tag>
    <item_tag>
      <headline_1>
        <tag_1>Product title 1</tag_1>
      </headline_1>
      <headline_2>
        <tag_2>Product attribute 1</tag_2>
      </headline_2>
    </item_tag>
    <item_tag>
      <headline_1>
        <tag_1>Product title 2</tag_1>
      </headline_1>
      <headline_2>
        <tag_2>Product attribute 2</tag_2>
      </headline_2>
    </item_tag>
  </middle_tag>
</root_tag>

into 2 products.

I now wish to go through each product and extract all the product information (by extracting its leaf nodes). For that purpose I am using this code:

products.each do |product|
  puts product #=> <item_tag><headline_1><tag_1>Product title 1</tag_1></headline_1><headline_2><tag_2>Product attribute 1</tag_2></headline_2></item_tag>
  product_data = product.xpath("//*[not(*)]")
  puts product_data #=> <tag_1>Product title 1</tag_1><tag_2>Product attribute 1</tag_2><tag_1>Product title 2</tag_1><tag_2>Product attribute 2</tag_2>
end

As you can see this does exactly what I want, exept for one thing: It reads through products instead of product.

How do I limit my search to product only? When answering, please note that the example is simplified. I would prefer that the solution "erase" the knowledge of products (if possible), beacause a then it will probably work in all cases.

The `//` selector in `//*[not(*)]` changes the scope of your xpath back to the document root element (the `root_tag`). You'll need to write this using a local selector, like `headline_1` or `headine_1/tag_1`, and not one with `//`. — Aaron Breckenridge, Mar 31 '13 at 15:02
OK, but do you have a suggestion to an expression that could handle this? I do in the code have this: **paths = ["/root_tag/middle_tag/item_tag/headline_1", "/root_tag/middle_tag/item_tag/headline_2"]**. Maybe we could extract **"headline_1"** and **"headline_2"** (the parts that does not occour in both) and then search for them locally... — JohnSmith1976, Mar 31 '13 at 15:44
I am a professional scrapper, so if you put some $$, I can do it for you,as you [requested](http://stackoverflow.com/questions/21752838/how-to-scrape-a-website-with-the-socksify-gem-proxy). If you interested, drop me an email as mentioned in my profile. — Arup Rakshit, Feb 13 '14 at 12:15
Thanks, but I am simply looking for a regular SO code answer, so I can put the code it into my app and do lots of stuff with it. — JohnSmith1976, Feb 13 '14 at 12:36

score 2 · Answer 1 · answered Mar 31 '13 at 16:13

2

Instead of:

//*[not(*)]

Use:

(//product)[1]//*[not(*)]

This selects the "leaf nodes" only under the first product element in the XML document.

Repeat this for all product elements in the document. You can get their count by:

count(//product)

answered Mar 31 '13 at 16:13

Dimitre Novatchev

240,661
26
293
431

Perfect, just what I was looking for. – JohnSmith1976 Mar 31 '13 at 16:23

score 0 · Answer 2 · answered Mar 31 '13 at 16:02

0

You may just want:

product_data = product.xpath("*")

which will all find sub-elements of product.

answered Mar 31 '13 at 16:02

Neil Slater

26,512
6
76
94

score 0 · Accepted Answer · answered Apr 05 '13 at 23:03

The answer is to simply add a . before //*[not(*)]:

product_data = product.xpath(".//*[not(*)]")

This tells the XPath expression to start at the current node rather than the root.

Mr. Novatchev's answer, while technically correct, would not result in the parsing code being idiomatic Ruby.

How to make an xpath expression read through a part of the document only (Ruby/Nokogiri/xpath)

3 Answers3