I am writing a spider with scrapy, however, I come across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?
-
A webbrowser does not store the markup of a page, it holds a DOM. – Bergi Jan 11 '13 at 03:07
4 Answers
You can use scraper with Webkit engine available out there.
One of them is dryscrape.
Example:
import dryscrape
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[@href]'):
print link['href']
# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"
See more info at:
https://github.com/niklasb/dryscrape
https://dryscrape.readthedocs.org/en/latest/index.html

- 1,557
- 2
- 22
- 34
-
There's also ghost (http://jeanphix.me/Ghost.py/) another headless webkit python implementation. I haven't tried both, so I can't say which is better. – Anton I. Sipos Jan 12 '13 at 04:19
-
Raslan: thanks for your suggestion, I am working on windows and when I tried to install 'dryscrape', it tells that the installation is success, however, it fails when I try to import drayscrape when running, saying 'from cssselect import GenericTranslator ImportError: No module named cssselect'. – user806135 Jan 17 '13 at 04:44
-
The installation guide for dryscrape have commmand pip install -r requirements.txt where the file requirements.txt have the list of packages to be installed. One of them is cssselect. Follow back the installation guide – Sharuzzaman Ahmat Raslan Jan 17 '13 at 04:49
-
when i try to run 'pip install -r requirement.txt', this error prompt make: *** No targets specified and no makefile found. Stop. error: src/webkit_server: No such file or directory – user806135 Jan 19 '13 at 06:31
-
and when I try to install webkit_server, same error prompt.make: *** No targets specified and no makefile found. Stop. error: src/webkit_server: No such file or directory – user806135 Jan 19 '13 at 06:32
If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.
More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question: Headless Browser for Python (Javascript support REQUIRED!)
It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.

- 1
- 1

- 3,493
- 3
- 27
- 26
I'm trying to find an answer to the same problem for a few days now.
I suggest you try QT framework with WebKit. There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more complex or you want to have 100% control over your code.
For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the command line but otherwise it's just great.

- 1,959
- 1
- 21
- 31
If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more complex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler. If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla.org/en/docs/SpiderMonkey

- 1,427
- 3
- 17
- 25