28

I want to use Mechanize to simulate browsing to a web page with active JavaScript, including DOM Events and AJAX, and so far I've found no way to do that.

I looked at some Python client browsers that support JavaScript like Spynner and Zope, and none of them really work for me. Spynner crashes PyQt all the time, and Zope doesn't support JavaScript as it seems.

Is there a way to simulate browsing with Python only (no extra processes) like WATIR or libraries that manipulate Firefox or Internet Explorer while supporting Javascript fully as if actually browsing the page?

ballade4op52
  • 2,142
  • 5
  • 27
  • 42
Jeff Klip
  • 281
  • 1
  • 3
  • 3
  • 1
    The Zope test browser (built on mechanize) never claimed to support JavaScript; where did you read that it might? – Martijn Pieters Apr 26 '11 at 18:21
  • 1
    Could you explain the problem you're trying to solve? It could be that you may not need JavaScript enabled after all. – Jordan Apr 26 '11 at 18:33
  • Tell us what you're trying to do and we'll tell you if we can help you! – jathanism Apr 26 '11 at 19:49
  • I'm trying to simulate browsing using strictly python. I can't use anything else because I need to use some specific tweaks and hooks that I can (currently) only do in python. I'm willing to even put in effort and try and bridge Mechanize and PyV8, but I have no idea where to start... Has anyone ever done anything like that before? – Jeff Klip Apr 28 '11 at 06:51

5 Answers5

24

I've played with this new alternative to Mechanize (which I love) called Phantom JS.

It is a full web kit browser like Safari or Chrome but is headless and scriptable. You script it with javascript, not python (as far as I know at least).

There are some example scripts to get you started. It's a lot like using Firebug. I've only spent a few min using it but I found I was quite productive right from the start.

newz2000
  • 2,602
  • 1
  • 23
  • 31
  • 2
    Nice tool! Why on earth do people downvote without explanation? – Antony Hatchkins Jan 31 '12 at 20:25
  • 13
    It's because 1) it's a Javascript tool when the question explicitly asks for a Python tool and 2) manipulating that tool via the JS API from Python would be a hacky PITA at best. – Cerin May 22 '12 at 20:41
  • +1 I think phnatomjs is the way to go, and JavaScript is the language of web – Anurag Uniyal Oct 27 '12 at 01:20
  • Does PhantomJS actually run the javascript that's on the pages it loads? (As distinct from the javascript in the phantomjs script.) I think it does, but it's hard to tell for sure. – LarsH Aug 29 '14 at 19:14
  • Yes, PhantomJS runs the page just like a regular web-browser does, though without a UI. – newz2000 Sep 03 '14 at 17:05
16

From http://wwwsearch.sourceforge.net/mechanize/faq.html#general

If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity.

Figure out what the JavaScript is doing and emulate it in your Python code: for example, by manually adding cookies to your CookieJar instance, calling methods on HTMLForms, calling urlopen, etc. See above re forms.

Use Java’s HtmlUnit or HttpUnit from Jython, since they know some JavaScript.

Instead of using mechanize, automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the Python for Windows extensions, aka pywin32, aka win32all (e.g. simple function, pamie; pywin32 chapter from the O’Reilly book) or ctypes (example). This kind of thing may also come in useful on Windows for cases where the automation API is lacking. For Firefox, there is PyXPCOM.

Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla’s JavaScript interpreter, for instance). This is what HtmlUnit and httpunit do. I did a spike along these lines some years ago, but I think it would (still) be quite a lot of work to do well.

Community
  • 1
  • 1
Jordan
  • 31,971
  • 6
  • 56
  • 67
6

Basically if you want something that deals with javascript then you need a real javascript engine, these invariably involve automating a real browser (I'm including headless ones in this).

Java’s HtmlUnit doesn't do a very good job as it doesn't use a javascript engine from an actual browser. Phantom JS sounds ideal (as newz2000 points out) however I find that when manipulating pages with javascript it can be very difficult to debug your script if you can't actually see the page you're dealing with.

This leads to solutions such as Selenium Webdriver which has a full python API to automate various browsers, however you must run a java jar and it actually launches the browser, so not the pure python solution you're after (but I think this is as close as you can get).

cerberos
  • 7,705
  • 5
  • 41
  • 43
  • I've used Selenium to automate Firefox via the Python API. It's a little buggy, but it generally works, and is probably the best solution I've seen. – Cerin May 22 '12 at 20:46
  • I too resorted to Selenium to automate web browsing for a project where running Javascript was required. For local development I used [chromedirver](http://code.google.com/p/chromedriver/) and for production I used [Selenium Server](http://seleniumhq.org/). The [Selenium Python binding docs](http://selenium-python.readthedocs.org/en/latest/index.html) are fairly helpful. – Dave Crumbacher Aug 03 '12 at 02:08
4

You can use Selenium with Python. You can then scrape JavaScript-generated content as well as manipulate the page with additional JavaScript (as well as Python).

# In your virtualenv: pip install selenium
from selenium import webdriver

# Launch Firefox GUI
browser = webdriver.Firefox()

# Alternatively, you can drive PhantomJS without a GUI
# With Node.js installed: `npm install -g phantomjs`
# browser = webdriver.PhantomJS()

# Fetch a webpage
browser.get('http://example.com')

# If you need the whole HTML document
# just like inspecting the rendered page with the console
html = browser.page_source

# Get an element, even if it was created with JS
button = browser.find_element_by_css_selector('div.some-class > \
                                               input.the-submit-button')

# Click on something
button.click()

# Execute some JavaScript (assumes jQuery is loaded on the page)
browser.execute_script("$('html, body').animate({ scrollTop: 500 }, 50);")

You can run the code in a Python REPL and use autocomplete to discover the methods available on browser or whatever element you have selected. Or do something like print(dir(browser)) to see what is available.

R891
  • 2,550
  • 4
  • 18
  • 30
3

An example how to use PyV8, to run JS on a DOM with python can be found here:

https://github.com/buffer/thug

This should be fairly easy to make it run together with mechanize.

Michael
  • 7,316
  • 1
  • 37
  • 63