14

I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

via: How can I control PhantomJS to skip download some kind of resource?

How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?

Note: I've already found how to prevent image download by editing service_args variable via:

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

and

PhantomJS 1.8 with Selenium on python. How to block images?

But service_args can't help me with resources like CSS. Thanks!

Community
  • 1
  • 1
YPCrumble
  • 26,610
  • 23
  • 107
  • 172
  • If all you want is the HTML and select elements from the page, is Selenium/PhantomJS the best option? Have you considered using [python-requests](http://docs.python-requests.org/en/latest/)? – brechin Oct 10 '13 at 13:43
  • @brechin, that's a great idea, thanks! Unfortunately I don't think python-requests can get javascript injected content. For example, see the main image on this page: https://www.everlane.com/collections/mens-luxury-tees/products/mens-v-antique. Everything in `
    ` is injected via backbone.js, and in my output from python-requests, I simply get an empty div with the `` comment... Might I be missing something?
    – YPCrumble Oct 14 '13 at 21:59
  • I'd look at the requests and just grab https://www.everlane.com/api/collections – brechin Oct 22 '13 at 23:11

3 Answers3

7

A bold young soul by the name of “watsonmw” recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

For a solution at all costs, consider building from source (which developers note “takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine”) and integrating his patch, linked above.

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

Until then, you’ll just get a Can't find variable: page exception.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Will McChesney
  • 191
  • 1
  • 4
  • It seems that the patch is already in Ghostdriver 1.1.0, but when I start it (with `phantomjs /path/to/ghostdriver/1.1.0/src/main.js`) and connect to it (with `driver = webdriver.PhantomJS(port=8910)` ) I still get `Can't find variable: page`. – MaratC Nov 17 '14 at 12:17
4

Will's answer got me on track. (Thanks Will!)

Current PhantomJS (1.9.8) includes Ghostdriver 1.1.0 which already contains watsonmw's patch.

You need to download the latest PhantomJS, perform the following (sudo may be required):

ln -s path/to/bin/phantomjs  /usr/local/share/phantomjs
ln -s path/to/bin/phantomjs  /usr/local/bin/phantomjs
ln -s path/to/bin/phantomjs  /usr/bin/phantomjs

And then try this:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
    var page = this; // won't work otherwise
    page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})
MaratC
  • 6,418
  • 2
  • 20
  • 27
2

Proposed solutions didn't work for me, but this one works (it uses driver.execute_script):

driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute_script('''
    this.onResourceRequested = function(request, net) {
        console.log('REQUEST ' + request.url);
    };
''')
AlexMe
  • 96
  • 8