3

I have a number of web pages that I am attempting to parse information from obtained using curl. Each of the page uses JQuery to transform its content upon the document being loaded in the browser (using the document.ready function) - mostly setting the classes/ids of divs. The information is much easier to parse once the Javascript functions have been loaded.

What are my options for (preferably from the command line) executing the Javascript content of the pages and dumping the transformed HTML?

mmccomb
  • 13,516
  • 5
  • 39
  • 46

1 Answers1

2

To scrape dynamic web, don't use static download tools like curl.

If you want to scrape dynamic web use a headless webbrowser which you can control from your programming language. The most popular tool for this is Selenium

http://code.google.com/p/selenium/

With Selenium you can export modified DOM tree out of the browser as HTML.

An example use case:

https://stackoverflow.com/a/10053589/315168

Community
  • 1
  • 1
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • Thanks Mikko, I ended up using Selenium with the Java & Chrome bindings to load each page and subsequently dump the page source - it worked a treat! – mmccomb May 20 '12 at 12:59