2

I use Ruby 1.9.3p385, Nokogiri and xpath v.1.

With help from awesome people on Stackoverflow I have come up with this xpath expression:

products = xml_file.xpath("(/root_tag/middle_tag/item_tag")

to split this XML file:

<root_tag>
  <middle_tag>
    <item_tag>
      <headline_1>
        <tag_1>Product title 1</tag_1>
      </headline_1>
      <headline_2>
        <tag_2>Product attribute 1</tag_2>
      </headline_2>
    </item_tag>
    <item_tag>
      <headline_1>
        <tag_1>Product title 2</tag_1>
      </headline_1>
      <headline_2>
        <tag_2>Product attribute 2</tag_2>
      </headline_2>
    </item_tag>
  </middle_tag>
</root_tag>

into 2 products.

I now wish to go through each product and extract all the product information (by extracting its leaf nodes). For that purpose I am using this code:

products.each do |product|
  puts product #=> <item_tag><headline_1><tag_1>Product title 1</tag_1></headline_1><headline_2><tag_2>Product attribute 1</tag_2></headline_2></item_tag>
  product_data = product.xpath("//*[not(*)]")
  puts product_data #=> <tag_1>Product title 1</tag_1><tag_2>Product attribute 1</tag_2><tag_1>Product title 2</tag_1><tag_2>Product attribute 2</tag_2>
end

As you can see this does exactly what I want, exept for one thing: It reads through products instead of product.

How do I limit my search to product only? When answering, please note that the example is simplified. I would prefer that the solution "erase" the knowledge of products (if possible), beacause a then it will probably work in all cases.

JohnSmith1976
  • 536
  • 2
  • 12
  • 35
  • 1
    The `//` selector in `//*[not(*)]` changes the scope of your xpath back to the document root element (the `root_tag`). You'll need to write this using a local selector, like `headline_1` or `headine_1/tag_1`, and not one with `//`. – Aaron Breckenridge Mar 31 '13 at 15:02
  • OK, but do you have a suggestion to an expression that could handle this? I do in the code have this: **paths = ["/root_tag/middle_tag/item_tag/headline_1", "/root_tag/middle_tag/item_tag/headline_2"]**. Maybe we could extract **"headline_1"** and **"headline_2"** (the parts that does not occour in both) and then search for them locally... – JohnSmith1976 Mar 31 '13 at 15:44
  • I am a professional scrapper, so if you put some $$, I can do it for you,as you [requested](http://stackoverflow.com/questions/21752838/how-to-scrape-a-website-with-the-socksify-gem-proxy). If you interested, drop me an email as mentioned in my profile. – Arup Rakshit Feb 13 '14 at 12:15
  • Thanks, but I am simply looking for a regular SO code answer, so I can put the code it into my app and do lots of stuff with it. – JohnSmith1976 Feb 13 '14 at 12:36

3 Answers3

2

Instead of:

//*[not(*)] 

Use:

(//product)[1]//*[not(*)] 

This selects the "leaf nodes" only under the first product element in the XML document.

Repeat this for all product elements in the document. You can get their count by:

count(//product)
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
0

You may just want:

product_data = product.xpath("*")

which will all find sub-elements of product.

Neil Slater
  • 26,512
  • 6
  • 76
  • 94
0

The answer is to simply add a . before //*[not(*)]:

product_data = product.xpath(".//*[not(*)]")

This tells the XPath expression to start at the current node rather than the root.

Mr. Novatchev's answer, while technically correct, would not result in the parsing code being idiomatic Ruby.

Mark Thomas
  • 37,131
  • 11
  • 74
  • 101