34

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?

<table >
  <tbody>
    <tr>  <!-- table header --> </tr>
  </tbody>
  <!-- show threads -->
  <tbody id="threadbits_forum_251">
    <tr>
      <td></td>
      <td></td>
      <td>
        <div>
          <a href="showthread.php?t=230708" >Vb4 Gold Released</a>
        </div>
        <div>
          <span><a>Paul M</a></span>
        </div>
      </td>
      <td>
          06 Jan 2010 <span class="time">23:35</span><br />
          by <a href="member.php?find=lastposter&amp;t=230708">shane943</a> 
        </div>
      </td>
      <td><a href="#">24</a></td>
      <td>1,320</td>
    </tr>

  </tbody>
</table>
C R
  • 2,182
  • 5
  • 32
  • 41
Radek
  • 13,813
  • 52
  • 161
  • 255
  • Actually, the attributes can make finding the data easier, especially with xpath. – Wayne Conrad Jan 14 '10 at 04:29
  • @Wayne could you tell me why attributes can make it easier? – Radek Jan 17 '10 at 18:57
  • 1
    Often you will find that the data you want has specific attributes that happen to make it easier for you to build an xpath to pick out that data. For example, if the table you want is ", and there are other tables you don't want but none of them have that CSS class, then the xpath for picking out the table you want is simply: "//table[@class='message']"
    – Wayne Conrad Jan 17 '10 at 22:57
  • _NOTE:_ Be very careful trying to use `` tags as way-points or in selectors. While the spec says HTML should have them, they're not required and a lot of HTML in the wild doesn't have them in the table definition. The problem is that browsers often add them when rendering the page and display them when you look at the page's source, so don't trust the browser's HTML source view. Instead _ALWAYS_ use `wget` or `curl` or `nokogiri` at the command-line to view the actual page source to verify the actual markup. – the Tin Man Feb 14 '20 at 02:27

1 Answers1

56
#!/usr/bin/ruby1.8

require 'nokogiri'
require 'pp'

html = <<-EOS
  (The HTML from the question goes here)
EOS

doc = Nokogiri::HTML(html)
rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr')
details = rows.collect do |row|
  detail = {}
  [
    [:title, 'td[3]/div[1]/a/text()'],
    [:name, 'td[3]/div[2]/span/a/text()'],
    [:date, 'td[4]/text()'],
    [:time, 'td[4]/span/text()'],
    [:number, 'td[5]/a/text()'],
    [:views, 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end
pp details

# => [{:time=>"23:35",
# =>   :title=>"Vb4 Gold Released",
# =>   :number=>"24",
# =>   :date=>"06 Jan 2010",
# =>   :views=>"1,320",
# =>   :name=>"Paul M"}]
Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
  • 3
    I think the css equivalent would be `doc.css('tbody#threadbits_forum_251 tr')`, but I haven't actually tested that in code... – kejadlen Jan 14 '10 at 05:41
  • @Kejadlen, I replaced the doc.xpath(...) call with your doc.css call, and it worked great. – Wayne Conrad Jan 14 '10 at 07:18
  • is it possible that somebody would explain the syntax to me? thank you in advance. – Radek Jan 14 '10 at 10:15
  • What's got you stumped? Is it the Ruby syntax, the xpath syntax, or both? – Wayne Conrad Jan 14 '10 at 15:59
  • hi Wayne, I am ruby baby. First of all ... I installed mechanize and it was said that it uses nokogiri to parse so I can use html nokogiri methods.I cannot make it work with setpu like that.Do I have to install nokogiri separately?But it seems to me that I have it already installed. doc = Nokogiri::XML(f) gives me an error ./nokogiri.rb:7: uninitialized constant Nokogiri (NameError). And then to be honest I did not understand xpath too. //table/tbody[@id="threadbits_forum_251"]/tr is like magic from different world for me. I'd say that it means search for table&tbody where id=xxx but why/tr – Radek Jan 15 '10 at 06:07
  • and why does it start whith // ? I cannot find any good (good enough for ME) documentation on that... – Radek Jan 15 '10 at 06:09
  • Yes, you already have nokogiri. See http://stackoverflow.com/questions/2060247/how-to-read-someone-elses-forum/2060983#2060983 for an example using mechanize. That example doesn't directly use nokogiri, except on the commented-out line to print the fetched html. But nokogiri is there inside mechanize if you need it (just call page.parser). The xpath you quoted means "get me any table, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251." – Wayne Conrad Jan 15 '10 at 07:44
  • @Wayne,thank you sooooo much.I updated the code following your other example and it is working now very nicely. I still have few questions.The most important is if you could suggest any documentation for me.Next one is why there is /tr at the end of the xpath you nicely explained to me.I want to extract url of the post too I tried [:url, 'td[3]/div[1]/a'], [:url, 'td[3]/div[1]/a href/text()'], [:url, 'td[3]/div[1]/a/href/text()'],[:url, 'td[3]/div[1]/a/href'], and nothing worked.Where can I learn how to extract href, id, alt, src etc? Thank you – Radek Jan 16 '10 at 06:46
  • @Wayne and another question is that I want to add some info from the post itself so I have to click it and add the info to the detail object. Where in your code I can add such code? I hope I am not asking much.. could you explain the code after details ??? Thank you – Radek Jan 16 '10 at 06:58
  • the forum I use to learn mechanize/nokorigi/parsing is http://www.vbulletin.org/forum/forumdisplay.php?f=251 – Radek Jan 16 '10 at 07:13
  • Radek, These are all great questions. What would you say to creating more SO questions? That way you'll get more people's answers. – Wayne Conrad Jan 16 '10 at 23:24
  • @Wayne Conrad: Wayne can I ask why you use array of hashes to store the data? why not hash of hashes or object? thank you – Radek Jan 22 '10 at 19:48
  • 3
    Mostly, because an array of hashes was the simplest thing that could possibly work, making for a clearer example. Also, and I don't know if this matters for you, in Ruby < 1.9, hashes don't have a well-defined order so you lose the original order of the rows. – Wayne Conrad Jan 22 '10 at 20:16