When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

Question

I'm trying to do some web scraping with node.js. Using jsdom, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.

Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom.

Is this possible?

It strikes me that there are two distinct challenges:

Running all the JS on the web page
Getting Node to simulate the window/screen rendering in addition to just the DOM

Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?

If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.

I don't know about node.js but you might look at htmlunit or selenium + headless gem — pguardiario, Oct 21 '11 at 06:00

score 4 · Answer 1 · answered Oct 27 '11 at 09:02

Take a look at PhantomJS. Incredibly simple to use.

http://www.phantomjs.org/

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.

score 1 · Answer 2 · answered Oct 24 '11 at 14:31

You can use:

htmlunit (java, jython)
PyQtWebKit or pygtk + webkit (python)
WWW::Mechanize::Firefox to scrape from firefox (perl)
Win32-IEAutomation to scrape from MS internet explorer (perl)

All those solutions can run javascript as well.

You will find many sample code right from http://stackoverflow.com searches

When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

2 Answers2

Linked