57

I need a headless browser which is fairly easy to use (I am still fairly new to Python and programming in general) which will allow me to navigate to a page, log into a form that requires Javascript, and then scrape the resulting web page by searching for results matching certain criteria, clicking check boxes, and clicking to download files. All of this requires Javascript.

I hear a headless browser is what I want - requirements/preferences are that I be able to run it from Python, and preferably that the resultant script will be compilable by py2exe (I am writing this program for other users).

So far Windmill looks like it MIGHT be what I want, but I am not sure.

Any ideas appreciated!

jbochi
  • 28,816
  • 16
  • 73
  • 90
Steven Matthews
  • 9,705
  • 45
  • 126
  • 232
  • Sorry, as far as I know this does not exist (yet). The best you can do now is run webdriver, driven from the Python interface. You can drive HtmlUnit that way, but that is written in Java so you have a combination of Java and Python. – Keith May 17 '11 at 03:43
  • Possibly related: http://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python – Danilo Bargen Jun 10 '13 at 09:35

6 Answers6

30

I use webkit as a headless browser in Python via pyqt / pyside:
http://www.riverbankcomputing.co.uk/software/pyqt/download
http://developer.qt.nokia.com/wiki/Category:LanguageBindings::PySide::Downloads

I particularly like webkit because it is simple to setup. For Ubuntu you just use: sudo apt-get install python-qt4

Here is an example script:
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

hoju
  • 28,392
  • 37
  • 134
  • 178
  • 1
    There's also [PySide](http://www.pyside.org/), which is similar to PyQt except under LGPL instead of GPL. – icktoofay May 17 '11 at 04:14
  • I think headless would imply no actual browser page, which while you can do that with webkit - I have found it to be useful being driven by Python. The only problem is that interacting with Javascript is not the easiest thing in the world - if I remember correctly you can't just inject stuff. But, I did manage to get a Python interpeter embedded into a Qt App that could 'drive' the webkit interface, so it's definitely got some juice. You might also want to take a look at http://sikuli.org/ for more of a test oriented solution. – synthesizerpatel Jun 10 '11 at 05:10
  • 1
    This is exactly what I've done for a project I'm working on using Django to have a web interface as well as a cross-platform qt interface. this way I can have feature parity at a very low cost. – theheadofabroom Jun 15 '11 at 08:13
  • @synthesizerpatel: webkit can be run headless and you can inject javascript via frame.evaluateJavaScript() – hoju Dec 20 '11 at 08:28
  • 18
    Any future visitors may wish to check out [Ghost.py](http://jeanphix.me/Ghost.py/) which provides a nice wrapper around PyQt/PySide. – Michael Mior Apr 27 '12 at 03:27
11

The answer to this question was Spynner

uhbif19
  • 3,139
  • 3
  • 26
  • 48
Steven Matthews
  • 9,705
  • 45
  • 126
  • 232
  • 2
    Spynner's dependency Libxslt requires vcvarsall.bat from VS 2008 which creates quite an ordeal seen here: [link](http://stackoverflow.com/questions/3047542/building-lxml-for-python-2-7-on-windows/5122521#5122521) We need an alternative. – User Jan 11 '14 at 20:33
  • 1
    Looks like Spynner does not support `python3` ATM – MarSoft Jan 24 '18 at 03:47
  • I'm not sure if this is the best answer at the present time, but in 2011 Python 2 was very viable – Steven Matthews Jan 24 '18 at 03:49
9

I'm in the midst of writing a Python driver for Zombie.js, "a lightweight framework for testing client-side JavaScript code in a simulated environment".

I'm currently at a standstill on a resolution to a bug in Node.js (before I write more tests and more code), but feel free to keep an eye on my project as it progresses:

https://github.com/ryanpetrello/python-zombie

RyanTheDev
  • 601
  • 1
  • 5
  • 11
5

There are not too many headless browsers yet that support Javascript.

You could try Zombie.js or Phantomjs. Those are not Python, but plain Javascript and those really can do the job.

esamatti
  • 18,293
  • 11
  • 75
  • 82
3

Try using phantomjs, it has great javascript support. Then you could run it as a subprocess of a python script

http://docs.python.org/library/subprocess.html

that could boss it around.

shelman
  • 2,689
  • 15
  • 17
  • I do this, looking to replace it as it's quite intensive to stop/start the process which greatly impacts performance. Then, once you start running phantomjs as a service you will encounter a range of issues like memory leaks. – Ross Mar 13 '16 at 22:24
1

You can use HTQL in combination with IRobotSoft webscraper. Check here for examples: http://htql.net/

seagulf
  • 380
  • 3
  • 5