Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.
The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.
Another option is PyQt4's QtWebKit class which I've only used for experimentation.
Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).
import os
import gtk
import jswebkit
import lxml.html
import pygtk
import webkit
def load_finished(view, frame):
# called when the document finishes loading
if frame != view.get_main_frame():
return
ctx = jswebkit.JSContext(frame.get_global_context())
res = ctx.EvaluateScript('window.location.href')
print res
res = ctx.EvaluateScript('document.body.innerHTML')
tree = lxml.html.fromstring(res)
print tree.xpath('//input[@type="submit"]')
# initialization
pygtk.require20()
gtk.gdk.threads_init()
# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)
# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)
# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()
# open google front page
view.open('http://www.google.com')
# spin, processing gtk events
while True:
try:
while gtk.events_pending():
gtk.main_iteration(False)
except KeyboardInterrupt:
break
Example output:
http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]