1

I am building a spider and I am using Beautiful soup to parse the contain of particular URL. Now, some sites are using Java-script to show dynamic contain which will be shown to user once some action [clicking or time] happens. Beautiful soup just parse the static contain which is before the java-script tag has run. I want the contain after java-script run. Is there any way to do this?

I can think of one way: Grab the url, open the browser and run this URL and java-script tags as well. And then pass this url to Beautiful soup, which can see contains which java-script[dynamic contains] has produced. However, if I am crawling millions of links then this solution is not useful. If there is some in-built module available which can generate dynamic contain of the Html page before hand.

Nisarg
  • 121
  • 3
  • 5

1 Answers1

2

Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.

The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.

Another option is PyQt4's QtWebKit class which I've only used for experimentation.

Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).

import os

import gtk
import jswebkit
import lxml.html
import pygtk
import webkit

def load_finished(view, frame):
    # called when the document finishes loading
    if frame != view.get_main_frame():
        return
    ctx = jswebkit.JSContext(frame.get_global_context())
    res = ctx.EvaluateScript('window.location.href')
    print res
    res = ctx.EvaluateScript('document.body.innerHTML')
    tree = lxml.html.fromstring(res)
    print tree.xpath('//input[@type="submit"]')

# initialization
pygtk.require20()
gtk.gdk.threads_init()

# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)

# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)

# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()

# open google front page
view.open('http://www.google.com')

# spin, processing gtk events
while True:
    try:
        while gtk.events_pending():
            gtk.main_iteration(False)
    except KeyboardInterrupt:
        break

Example output:

http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]
samplebias
  • 37,113
  • 6
  • 107
  • 103
  • Thanks samplebias. Thing is, it gives me an error saying " could not open display"... I have tried everything like set display variable or use -c option with python. However, the same error I am getting. Is there any way? – Nisarg Apr 25 '11 at 18:34
  • The problem stems from gtk/webkit being unable to connect to your X display to show the browser window. If you are ssh-ing into a server, you need to be using X and enable X11 forwarding on your session, e.g. `ssh -Y [remote host]`. This should set the shell $DISPLAY variable, which you can verify with `echo $DISPLAY` – samplebias Apr 25 '11 at 18:40
  • Sure, here's a link to the [jswebkit homepage](http://packages.debian.org/source/sid/python-jswebkit) and a direct link to [download the jswebkit 0.0.3 source code](http://ftp.de.debian.org/debian/pool/main/p/python-jswebkit/python-jswebkit_0.0.3.orig.tar.gz). I've only used this code on Ubuntu 10.04+; I have no experience running under Centos 5.x. – samplebias Apr 26 '11 at 00:20
  • Thanks man. I am working on that, I will let you know once I get it done. I am just curious about the package[jswebkit]. It has .pyx file and .pyi files. For now, I am planning to put these all files in site-packages. Question is, do I need to install anything related to CPython? – Nisarg Apr 27 '11 at 00:03
  • I may not be able to help with build info in detail, as I've personally never built the Webkit/JavascriptCore stack on Centos. I'm pretty sure jswebkit will require a recent version of Cython (like 0.14) to build. Once you have Cython installed you should be able to build an RPM for jswebkit with `python setup.py bdist_rpm`. – samplebias Apr 27 '11 at 00:26