4

enter image description here

I try to extract all five rows listed in the table above.

I'm using Ruby hpricot library to extract the table rows using xpath expression.

In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.

The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.

EDIT: Nothing magic about the code, just attaching it upon request.

require 'open-uri'
require 'hpricot'

faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
  puts text.to_s
end
Terry Li
  • 16,870
  • 30
  • 89
  • 134

2 Answers2

9

The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.

I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:

require 'open-uri'
require 'nokogiri'

faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))

faculty.search("/html/body/center/table/tr").each do |text|
  puts text
end

Maybe you should switch?

qerub
  • 1,526
  • 16
  • 11
0

The path table/tr does not exist. It's table/tbody/tr or table//tr. When you use table/tr, you're specifically looking for a <tr> that is a direct descendant of <table>, but from your image, this isn't how the markup is structured.

d11wtq
  • 34,788
  • 19
  • 120
  • 195
  • 1
    tbody isn't present in this example. The firefox extension firebug adds the extra tag for us. table/tr works here, as mentioned in my original question, but only partially. I'm able to extract the first three rows but not the last two, which is really weird. – Terry Li Nov 20 '11 at 23:05
  • I didn't realize firebug added additional tags. That explains why I was having such a hard time today using nokogiri and firebug together to locate the TR rows I cared about. (I had a table embedded within a table all without id's.) Now I wonder if the HTML wasn't valid in the first place. – beach Jun 27 '12 at 00:52