How do I use XPath in Nokogiri?

Question

I have not found any documentation nor tutorial for that. Does anything like that exist?

doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')

The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.

Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?

td[3]/div[1]/a/text()' <--- extracts text

How can I extract other things?

score 49 · Accepted Answer · edited May 19 '19 at 20:56

49

Seems you need to read a XPath Tutorial

Your //table/tbody[@id="threadbits_forum_251"]/tr expression means:

// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[@id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements

So, basically, you need to know:

attributes begins with @
conditions go inside [] brackets

If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/@href if there is just one <a> element.

edited May 19 '19 at 20:56

lurker

56,987
9
69
103

answered Jan 17 '10 at 11:32

Rubens Farias

57,174
8
131
162

@Rubens thank you. And you're right I need to read the XPath Tutorial.I thought it was nokorigi doc I need to read... would you know if there is any tool that would give me full Xpath if I click and object on html page? – Radek Jan 17 '10 at 11:50
9

I dont know, but XPath isn't that hard; consider your filesystem, and lets assume every folder is a XML element; so, when you select your `system32` folder, you'll get `\windows\system32` path; just replace that `\\`` by `/`, consider attributes beginning with `@` and conditions by `[]` and you're good to go – Rubens Farias Jan 17 '10 at 12:01
2

I know this is an older answer but the link to the xpath tutorial is now broken. I think it should now be http://www.w3schools.com/xsl/xpath_intro.asp – Axiombadger Feb 17 '16 at 13:06
1

Fixed, @Axiombadger ! – Rubens Farias Feb 17 '16 at 16:13

score 7 · Answer 2 · edited Aug 20 '13 at 14:37

7

Your XPath is correct and you seem to have answered your own question's first part (almost):

doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')

"the code above will get me any ~~table~~ table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"

// means the following element can appear anywhere in the document.

/tr at the end means, get the tr node of the matching element.

You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:

theNode['href']
theNode['src']

Where theNode is your Nokogiri Node object.

Edit:

Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.

doc.xpath("td[3]/div[1]/a").each do |anchor|
    puts anchor['href']
    puts anchor['src']
    ...
end

edited Aug 20 '13 at 14:37

the Tin Man

158,662
42
215
303

answered Jan 17 '10 at 11:36

Anurag

140,337
36
221
257

@Anurag thank you for nice explanation.I am using mechanize not pure nokogiri,can I use theNode['href'] somehow in [:title, 'td[3]/div[1]/a/text()'],? I want to extract href instead of text – Radek Jan 17 '10 at 11:48
1

`[:address, 'td[3]/div[1]/a/@href']` ? – Rubens Farias Jan 17 '10 at 11:51
i was searching for nokogiri tutorials and came across my own answer.. hehe :) – Anurag Apr 21 '11 at 19:32
Mechanize uses Nokogiri internally, so it is using pure Nokogiri, it's just behind the curtain. Mechanize's [`Mechanize::Page.parser`](http://mechanize.rubyforge.org/Mechanize/Page.html#method-i-parser) returns the root of the parsed page as a Nokogiri document. – the Tin Man Aug 20 '13 at 14:40

How do I use XPath in Nokogiri?

2 Answers2

Linked