HTML is read before fully loaded using open-uri and nokogiri

Question

I'm using open-uri and nokogiri with ruby to do some simple webcrawling. There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar. What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?

Currently my script looks like:

require 'nokogiri'
require 'open-uri'

url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)) 
puts doc.at_css("h2").text

[example.com](http://example.com) source does not have any
tag. — rputikar, Dec 09 '12 at 17:15
It's just a placeholder for my question. Sorry to be misleading. — Chelsea White, Dec 09 '12 at 17:17
Define "fully loaded", what about images, external scripts, ajax content, etc.? — Dave Newton, Dec 09 '12 at 17:21
I mean the main part of the page(a list of blogs) w/o external scripts and ajax content. — Chelsea White, Dec 09 '12 at 17:25
Are you sure what you think is happening is what's happening? I.e., did you check with curl or similar? I've not seen the behavior you describe. — Dave Newton, Dec 09 '12 at 17:48
Yes, I've checked with curl and the body part of the page only shows the loading icon. When I open the page with a browser, I can also see the loading icon, and after about 2 seconds, the content of the body part appears. — Chelsea White, Dec 09 '12 at 17:56
That would suggest that the content is being loaded via AJAX or some other JS method and that the raw HTML source (which is all curl/nokogiri can see) doesn't contain what you want. In which case you'll need to pick another scraper that is JS/ajax aware. — Philip Hallstrom, Dec 09 '12 at 18:53
Or, find out the AJAX URL for the content you want and request that directly. — Mark Thomas, Dec 09 '12 at 23:01

score 14 · Accepted Answer · answered Dec 09 '12 at 22:29

14

What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

This might open a browser window though.

answered Dec 09 '12 at 22:29

akuhn

27,477
2
76
91

It's pretty handy. Thanks! – Chelsea White Dec 10 '12 at 13:46
6

Is there a way to avoid browser opening? – lcguida Oct 14 '15 at 06:54
3

@lcguida browser = Watir::Browser.new :chrome, headless: true – Seph Cordovano Apr 17 '18 at 01:18

HTML is read before fully loaded using open-uri and nokogiri

tag.

1 Answers1

Linked