extract single string from HTML using Ruby/Mechanize (and Nokogiri)

Question

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.

Sample code:

 require 'rubygems'
 require 'mechanize'

   post_agent = WWW::Mechanize.new
    post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
    puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts  post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
    puts post_page.parser.xpath('//[@id="post1960370"]/tbody/tr[1]/td/div[2]/text()')

all my attempts end with empty string or an error.

I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:

After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.

But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.

score 28 · Accepted Answer · edited Sep 07 '12 at 19:59

28

Radek. I'm going to show you how to fish.

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

And that's how you do it.

edited Sep 07 '12 at 19:59

the Tin Man

158,662
42
215
303

answered Jan 22 '10 at 03:29

Wayne Conrad

103,207
26
155
191

2

PS: This is a general tutorial showing how you figure out the correct xpath: You don't start with a fully specified xpath, because then you've got no idea what's wrong if it returns nothing. Start with something so general that it's guaranteed to return something, and then keep making it more specific until you have the one thing you want. By doing it in steps, when it doesn't work you know it's the last thing you added to the xpath. – Wayne Conrad Jan 22 '10 at 03:57
@Wayne Conrad: Hi Wayne,thank you for nice tutorial.I will try what you say but I thought that as I want only the first instance of the element it would be easy and fast to use absolute xpath. And it would give me the first item from the array. – Radek Jan 22 '10 at 03:58
So you would follow all these steps even if you want to get the number how many times this question was viewed? – Radek Jan 22 '10 at 04:03
1

Yes, I always figure out my xpaths iteratively. Someone who is good at xpath might be able to get it right the first time. That someone is not me. It's not the xpath that decides whether you get one thing or many. It's whether you call "xpath" or "at_xpath". If you call "xpath", you'll always get one thing; if multiple elements matched, you'll only get the first one. If you call "at_xpath", you'll always get an array, even if you matched just one thing. – Wayne Conrad Jan 22 '10 at 04:17
wow,this is something I was looking for. the difference between 'xpath' and 'at_xpath'.Great! thank you for that.How did you learn that? – Radek Jan 22 '10 at 04:18
I cannot get why full xpath doesn't work!? Full xpath + 'at_xpath' will give the the first match and I would be happy :-) – Radek Jan 22 '10 at 04:19
Did you try what I said? Start with '//table', then get it to pick out just the one table that has the data you want. – Wayne Conrad Jan 22 '10 at 04:31
I am almost there. I have an array of 15 tables (=15posts) where the first one table has the data that I want. The xpath is "//div[@id='posts']/div/table" if I add tbody to be more specific it gives me nill – Radek Jan 22 '10 at 04:42
one line solution is puts post_page.parser.xpath("//div[@id='posts']/div/table/tr/td/div[2]")[0].xpath('text()').to_s.strip – Radek Jan 22 '10 at 05:14
What happens if you use two slashes before tbody instead of one? What does that tell you? – Wayne Conrad Jan 22 '10 at 05:14
"//div[@id='posts']/div/table//tbody/tr/td" gives nill too – Radek Jan 22 '10 at 05:18
when I used .at_xpath anywhere in this exercise I got no results – Radek Jan 22 '10 at 05:24
Even when you use at_xpath('//table')? – Wayne Conrad Jan 22 '10 at 05:37
yes, at_xpath('//table') gives me something. Even puts post_page.parser.at_xpath("//div[@id='posts']/div/table/tr/td/div[2]") gives me what I want. But to extract the final piece I have to use xpath, at_xpath give me empty string. – Radek Jan 22 '10 at 06:07
Go back to the other answers I've given you. You'll see something different at the end of the final xpaths. – Wayne Conrad Jan 22 '10 at 07:56
I apologize for my inability to communicate clearly. I honestly don't know where to go now--this is now an individual tutoring session, which I'm not sure SO is for. That's not what bothers me, though. I'm bothered that I haven't figured out how to communicate the key concepts I want to get across. There are general problem solving principles in programming that I want to communicate that will help you solve not just this problem but any problem. Sadly, I am not up to the task. – Wayne Conrad Jan 22 '10 at 17:14
@Wayne Conrad: you did good job. I can now do fishing by myself :-) I will post separate question/s for clarification. Let's close is here. Thank you so much. – Radek Jan 22 '10 at 18:10

score 6 · Answer 2 · answered Dec 29 '10 at 20:39

6

I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again. if it still doesn't work ... then follow Wayne Conrad's process that's the best!

answered Dec 29 '10 at 20:39

RubyDubee

2,426
2
23
34

6

The source inside a browser is always suspect because the browser can, and will, do a lot of fixup of bad HTML or just massage it into the format they want it to be in. The `` tag is a good example. I use the browser's source view as a "it's kind of like this" view, but retrieve the actual HTML directly from the host and look at it in an editor when I'm trying to parse if things seem to be nonsense. Using IRB with an open and poking at the parsed doc is good enough often but there are times it takes having the editor open. – the Tin Man Dec 29 '10 at 23:03

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

2 Answers2

Linked

Related