2

I have a python crawler which uses phanthomjs to crawl the sites and I am trying to stop loading 'css' contents from those webpages.I found a following code from various internet sources to stop 'CSS' loading, but that is not working .Please help me in fixing this issue.I also tried other solutions mentioned in stack overflow but that too didn't worked.

driver = webdriver.PhantomJS()

driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.execute('executePhantomScript', {'script': '''
var page = this;
page.onResourceRequested = function(requestData, request) {
 if ((/http:\/\/.+?\.css/gi).test(requestData['https://www.whatismyip.com/']) || requestData.headers['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['https://www.whatismyip.com/']);
        request.abort();
}
''', 'args': []})

driver.get("https://www.whatismyip.com/")
ipaddress=driver.find_element_by_xpath("//div[@class='ip']").text
print ipaddress
driver.quit()
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
Jeya Kumar
  • 1,002
  • 1
  • 13
  • 36
  • An alternative option could be to start up a proxy and let it filter out requests with text/css mimetype. And here is how you can specify it when initializing PhantomJS webdriver instance: http://stackoverflow.com/questions/14699718/how-do-i-set-a-proxy-for-phantomjs-ghostdriver-in-python-webdriver. – alecxe Sep 10 '15 at 12:28
  • Hi,Thanks for you suggestion.I saw that link and it is describing how to set the proxy and already i have already my proxy settings as follows `service_args = [--proxy=x.x.x.x:8080,'--proxy-type=http','--web-security=false','--ignore-ssl-errors=true','--local-to-remote-url-access=true',] webdriver.PhantomJS.__init__(self,service_args=service_args,desired_capabilities=dcap)` . Could you please suggest what change do i have to make in this settings – Jeya Kumar Sep 10 '15 at 12:59

1 Answers1

0

You're testing the regex against requestData['https://www.whatsmyip.com/'] which I'm assuming is null -- this is fixed by using requestData.url as per the documentation. Also, a request will not contain a Content-Type so this conditional can be removed.

I chose to simplify your regular expression, since some URLs may be served with SSL or relative and will not match http://. I will use a $ anchor to test for .css at the end of the URL (the global modifier is not necessary, since you're only looking for one match).

Your final .onResourceRequested callback may contain a conditional like this:

if(/\.css$/i.test(requestData.url)) {
    request.abort();
}
Sam
  • 20,096
  • 2
  • 45
  • 71
  • Thanks for your effort .I modified the code as follows but still it is not working.`driver = webdriver.PhantomJS() driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute') driver.execute('executePhantomScript', {'script': ''' var page = this; page.onResourceRequested = function(requestData, request) { if(/\.css$/i.test(requestData.url)) { request.abort(); } ''', 'args': []}) driver.get("https://www.whatismyip.com/") ipaddress=driver.find_element_by_xpath("//div[@class='ip']").text print ipaddress driver.quit()` . – Jeya Kumar Sep 10 '15 at 14:29
  • Do I have to make any other changes in above code,I am really confused in achieving this. – Jeya Kumar Sep 10 '15 at 14:30