0

I am trying to scrape this page using Nokogiri to get all the elements with class name of "teaser".

If I check the page with jQuery, I can see there are 25 elements:

$(".teaser").length => 25

However, when using Nokogiri, I only get the first teaser:

teasers = doc.css('.teaser')
teasers.count => 1

Where am I going wrong? How do I get all the teasers?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jackson Cunningham
  • 4,973
  • 3
  • 30
  • 80
  • If you see output of "doc.to_html" , You will get only one teaser element. – dnsh Sep 12 '16 at 18:22
  • You should take a look at http://stackoverflow.com/questions/13789583/html-is-read-before-fully-loaded-using-open-uri-and-nokogiri – dnsh Sep 12 '16 at 18:35

1 Answers1

1

That document appears to have a load of null bytes in it for some reason, and this is causing Nokogiri/LibXML to assume the document has finished part way through.

You should be able to fix it by preprocessing the contents to remove the nulls. If page contains the text of the webpage:

page.gsub! /\x00/, ''

Then use Nokogiri on page as before.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
matt
  • 78,533
  • 8
  • 163
  • 197