52

phantomjs has config loadImage,

but I want more,

how can I control phantomjs to skip download some kind of resource,

such as css etc...

=====

good news: this feature is added.

https://code.google.com/p/phantomjs/issues/detail?id=230

The gist:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};
user2864740
  • 60,010
  • 15
  • 145
  • 220
atian25
  • 4,166
  • 8
  • 37
  • 60

4 Answers4

17

UPDATED, Working!

Since PhantomJS 1.9, the existing answer didn't work. You must use this code:

var webPage = require('webpage');
var page = webPage.create();

page.onResourceRequested = function(requestData, networkRequest) {
  var match = requestData.url.match(/wordfamily.js/g);
  if (match != null) {
    console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
    networkRequest.cancel(); // or .abort() 
  }
};

If you use abort() instead of cancel(), it will trigger onResourceError.

You can look at the PhantomJS docs

webo80
  • 3,365
  • 5
  • 35
  • 52
7

So finally you can try this http://github.com/eugenehp/node-crawler

otherwise you can still try the below approach with PhantomJS

The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.

Another way is just simply block the hosts in the firewall.

Optionally you can use a proxy to block certain URL addresses and queries to them.

And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.

Eugene Hauptmann
  • 1,255
  • 12
  • 16
  • I wonder why PhatomJS don't this itself? sometime,we need to load a lot of page without css/img, can't exclude unwanted resource by hand – atian25 Jun 18 '12 at 04:02
  • There is such thing as page.content, you can manipulate it with some kind of filtering resources using regex filters (css, js). Or you can simply crawl the webpage and parse only images you want to left. – Eugene Hauptmann Jun 18 '12 at 07:27
  • thanks for reply. Did you mean that there is some filter interface/api provided by Phantomjs that we can skip some kind of resource?(don't download it anymore). what's the api name? – atian25 Jun 19 '12 at 01:03
  • sorry, that's not a part of PhantomJS API, I did mean something like String.replace() or Javascript RegExp Object. Take a look http://www.w3schools.com/jsref/jsref_obj_regexp.asp – Eugene Hauptmann Jun 19 '12 at 07:54
  • not what I need. I need to give PhantomJS an page url and spicial type resources, so it can download what I want at this url, but don't need me to tell it which detail resouces url to download. – atian25 Jun 19 '12 at 08:32
  • can you send me the test page, and test case of resources you want to load on? – Eugene Hauptmann Jun 19 '12 at 10:15
  • such as I want to write a spider to collect questions at http://stackoverflow.com/ , I just want to give PhantomJS this url, and I want it only download the page without css/image/js (for some reasons such as speed etc.) – atian25 Jun 20 '12 at 03:20
  • Hi, take a look to https://github.com/eugenehp/node-crawler then, it just crawl the webpage, no webkit, no rendering capabilities. But feature of fast loading and DOM manipulating. – Eugene Hauptmann Jun 20 '12 at 05:09
  • thanks, node.io / jsdom can do it too, just wonder why phantomjs don't support this feature – atian25 Jun 21 '12 at 09:19
6

Use page.onResourceRequested, as in example loadurlwithoutcss.js:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || 
            requestData.headers['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};
bain
  • 1,710
  • 14
  • 15
3

No way for now (phantomjs 1.7), it does NOT support that.

But a nasty solution is using a http proxy, so you can screen out some request that you don't need

SHAWN
  • 31
  • 2
  • Of course this is the best solution, btw you should always use a proxy (varnish or squid) to "control" what your programs are downloading (to add queuing, caching etc....) – Thomas Decaux Jun 25 '13 at 13:46