14

Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

Marco
  • 4,345
  • 6
  • 43
  • 77
  • 4
    Lots of helpful answers on similar questions here: http://stackoverflow.com/search?q=scraping+python – eozzy Feb 03 '10 at 08:21
  • 1
    Exact Duplicate: http://stackoverflow.com/questions/2081586/web-scraping-with-python – S.Lott Feb 03 '10 at 11:06
  • 1
    no not an exact duplicate. This one mentions JavaScript, which requires different tools than when working with static HTML. – hoju Feb 07 '10 at 21:09

7 Answers7

13

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html
hoju
  • 28,392
  • 37
  • 134
  • 178
  • Plumo - am trying to use this code to scrape a website but am not sure what to do with the 'javascript_html' variable once it's returned. `print javsascript_html` returns the error `UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 4200: ordinal not in range(128)`. please help! :) – significance Nov 15 '10 at 17:30
  • I am trying this with Python 3, but the rendered html does not have its Javascript processed. Here is the code: [link](http://pastebin.com/vzX9p7jv) – karmapolice Jun 01 '15 at 15:34
  • this was tested with Python 2, Python 3 will almost certainly require some changes – hoju Jun 03 '15 at 02:25
11

Beautiful soup is still probably your best bet.

If you need "JavaScript support" for the purpose of intercepting Ajax requests then you should use some sort of capture too (such as YATT) to monitor what those requests are, and then emulating / parsing them.

If you need "JavaScript support" in order to be able to see what the end result of a page with static JavaScript is, then my first choice would be to try and figure out what the JavaScript is doing on a case-by-case basis (e.g. if the JavaScript is doing something based on some Xml, then just parse the Xml directly instead)

If you really want "JavaScript support" (as in you want to see what the html is after scripts have been run on a page) then I think you will probably need to create an instance of some browser control, and then read the resulting html / dom back from the browser control once its finished loading and parse it normally with beautiful soup. That would be my last resort however.

Big Rich
  • 5,864
  • 1
  • 40
  • 64
Justin
  • 84,773
  • 49
  • 224
  • 367
  • 2
    While BeautifulSoup works beautifully with 'static' HTML markup which comes `as-is` from the server, it will fail miserably with single-page style ajaxy web apps that generate their content dynamically via Javascript and XMLHttpRequests. It will also fail on sites that rely on Javascript to maintain session state and navigation specifically in order to prevent web scraping. – ccpizza Apr 17 '13 at 21:06
4

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Here you go: http://scrapy.org/

lprsd
  • 84,407
  • 47
  • 135
  • 168
3

Selenium maybe? It allows you to automate an actual browser (Firefox, IE, Safari) using python (amongst other languages). It is meant for testing websites, but seems it should be usable for scraping as well. (disclaimer: never used it myself)

Steven
  • 28,002
  • 5
  • 61
  • 51
1

The Webscraping library wraps the PyQt4 WebView into a simple and easy-to-use API.

Here is a simple example to download a web page rendered by WebKit and extract the title element using XPath (taken from the URL above):

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')
ccpizza
  • 28,968
  • 18
  • 162
  • 169
-1

you can try spidermonkey ?

This Python module allows for the implementation of Javascript? classes, objects and functions in Python, as well as the evaluation and calling of Javascript scripts and functions. It borrows heavily from Claes Jacobssen's Javascript Perl module, which in turn is based on Mozilla's PerlConnect Perl binding.

Jacob Schoen
  • 14,034
  • 15
  • 82
  • 102
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
-2

I have not found anything for this. I use a combination of beautifulsoup and custom routines...

Art
  • 1,027
  • 1
  • 12
  • 21