5

I'm trying to do some web scraping with node.js. Using jsdom, it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements.

Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom.

Is this possible?

It strikes me that there are two distinct challenges:

  1. Running all the JS on the web page
  2. Getting Node to simulate the window/screen rendering in addition to just the DOM

Another way to ask the question: is it possible to use node.js as a completely headless browser that you can script?

If this isn't possible, does anyone have suggestions for what library I can use to do this? I'm relatively language agnostic.

Ted Benson
  • 748
  • 2
  • 7
  • 16

2 Answers2

4

Take a look at PhantomJS. Incredibly simple to use.

http://www.phantomjs.org/

PhantomJS is a command-line tool that packs and embeds WebKit. Literally it acts like any other WebKit-based web browser, except that nothing gets displayed to the screen (thus, the term headless). In addition to that, PhantomJS can be controlled or scripted using its JavaScript API.

Dal Hundal
  • 3,234
  • 16
  • 21
1

You can use:

  • htmlunit (java, jython)
  • PyQtWebKit or pygtk + webkit (python)
  • WWW::Mechanize::Firefox to scrape from firefox (perl)
  • Win32-IEAutomation to scrape from MS internet explorer (perl)

All those solutions can run javascript as well.

You will find many sample code right from http://stackoverflow.com searches

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223