2

I tried to build a Tornado application that could provide RESTful APIs to craw web pages. And I found that CurlAsyncHTTPClient cannot fetch a fully loaded page or a js-generated page.

Are there any solutions to this problem? Is there a library that could fetch fully loaded pages or js-generated pages and work with Tornado's non-blocking mechanism?

I would appreciate if you can provide any suggestions or solutions. :)

zeck
  • 769
  • 1
  • 7
  • 13

1 Answers1

0

The http client (whenever it is from Tornado, async, curl, simple, requests, ...) makes a request to given resource and fetches the response - nothing more, no parsing response, executing.

Instead you want fetch response, parse it, download all dependencies (js), execute/render it in the right order. Basically you will need to write a browser or use one.

There are many implementations of headless browser (mostly based on webkit). The worth noting (in python) is Ghost.py and selenium (with phantomjs) - Is there a way to use PhantomJS in Python?.

The main drawbacks, in the context of your question, is they do not support Tornado's in any way (http client as a mechanism to making requests, and so on). So I would, personally, drop Tornado for this task.

With PySide (Qt) - dirty example without Tornado:

from PySide.QtCore import *
from PySide.QtGui import *
from PySide.QtWebKit import QWebPage


class DownloadPage(QWebPage):

    html = ''

    def __init__(self, url, app, parent=None):
        QWebpage.__init__(self, parent)
        DownloadPage.html = ''    
        self.loadFinished.connect(app.quit)
        self.mainFrame().load(QUrl(url))

    def save(self):
        DownloadPage.html = self.mainFrame().toHtml()


def get_page(url):
    app = QApplication.instance() 
    if not app: # create QApplication if it doesnt exist
        app = QApplication(sys.argv)
    dp = DownloadPage(url, app)
    app.aboutToQuit.connect(dp.save)
    app.exec_()
    return DownloadPage.html
Community
  • 1
  • 1
kwarunek
  • 12,141
  • 4
  • 43
  • 48