9

I'm using open-uri and nokogiri with ruby to do some simple webcrawling. There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar. What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?

Currently my script looks like:

require 'nokogiri'
require 'open-uri'

url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)) 
puts doc.at_css("h2").text
Chelsea White
  • 285
  • 5
  • 19
  • [example.com](http://example.com) source does not have any

    tag.

    – rputikar Dec 09 '12 at 17:15
  • It's just a placeholder for my question. Sorry to be misleading. – Chelsea White Dec 09 '12 at 17:17
  • 1
    Define "fully loaded", what about images, external scripts, ajax content, etc.? – Dave Newton Dec 09 '12 at 17:21
  • I mean the main part of the page(a list of blogs) w/o external scripts and ajax content. – Chelsea White Dec 09 '12 at 17:25
  • 1
    Are you sure what you think is happening is what's happening? I.e., did you check with curl or similar? I've not seen the behavior you describe. – Dave Newton Dec 09 '12 at 17:48
  • Yes, I've checked with curl and the body part of the page only shows the loading icon. When I open the page with a browser, I can also see the loading icon, and after about 2 seconds, the content of the body part appears. – Chelsea White Dec 09 '12 at 17:56
  • 2
    That would suggest that the content is being loaded via AJAX or some other JS method and that the raw HTML source (which is all curl/nokogiri can see) doesn't contain what you want. In which case you'll need to pick another scraper that is JS/ajax aware. – Philip Hallstrom Dec 09 '12 at 18:53
  • Or, find out the AJAX URL for the content you want and request that directly. – Mark Thomas Dec 09 '12 at 23:01

1 Answers1

14

What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

This might open a browser window though.

akuhn
  • 27,477
  • 2
  • 76
  • 91