How can I control PhantomJS to skip download some kind of resource?

Question

phantomjs has config loadImage,

but I want more,

how can I control phantomjs to skip download some kind of resource,

such as css etc...

=====

good news: this feature is added.

https://code.google.com/p/phantomjs/issues/detail?id=230

The gist:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

I would also like to know this, how to make phantomjs skip a particular resource — iwek, Jun 05 '12 at 20:35
@user2864740 Why edit it into the question and not post as an answer? — Artjom B., Aug 14 '14 at 08:07

webo80 · Answer 1 · 2017-02-03T08:21:20.093

17

UPDATED, Working!

Since PhantomJS 1.9, the existing answer didn't work. You must use this code:

var webPage = require('webpage');
var page = webPage.create();

page.onResourceRequested = function(requestData, networkRequest) {
  var match = requestData.url.match(/wordfamily.js/g);
  if (match != null) {
    console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
    networkRequest.cancel(); // or .abort() 
  }
};

If you use abort() instead of cancel(), it will trigger onResourceError.

You can look at the PhantomJS docs

edited Feb 03 '17 at 08:21

answered Jun 22 '15 at 11:30

webo80

3,365
5
35
52

phantomjs 2.1.1 no cancel() just abort() – jmp Sep 12 '16 at 20:51

Eugene Hauptmann · Answer 2 · 2012-06-21T11:38:56.870

7

So finally you can try this http://github.com/eugenehp/node-crawler

otherwise you can still try the below approach with PhantomJS

The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.

Another way is just simply block the hosts in the firewall.

Optionally you can use a proxy to block certain URL addresses and queries to them.

And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.

edited Jun 21 '12 at 11:38

answered Jun 17 '12 at 20:30

Eugene Hauptmann

1,255
12
16

I wonder why PhatomJS don't this itself? sometime,we need to load a lot of page without css/img, can't exclude unwanted resource by hand – atian25 Jun 18 '12 at 04:02
There is such thing as page.content, you can manipulate it with some kind of filtering resources using regex filters (css, js). Or you can simply crawl the webpage and parse only images you want to left. – Eugene Hauptmann Jun 18 '12 at 07:27
thanks for reply. Did you mean that there is some filter interface/api provided by Phantomjs that we can skip some kind of resource?(don't download it anymore). what's the api name? – atian25 Jun 19 '12 at 01:03
sorry, that's not a part of PhantomJS API, I did mean something like String.replace() or Javascript RegExp Object. Take a look http://www.w3schools.com/jsref/jsref_obj_regexp.asp – Eugene Hauptmann Jun 19 '12 at 07:54
not what I need. I need to give PhantomJS an page url and spicial type resources, so it can download what I want at this url, but don't need me to tell it which detail resouces url to download. – atian25 Jun 19 '12 at 08:32
can you send me the test page, and test case of resources you want to load on? – Eugene Hauptmann Jun 19 '12 at 10:15
such as I want to write a spider to collect questions at http://stackoverflow.com/ , I just want to give PhantomJS this url, and I want it only download the page without css/image/js (for some reasons such as speed etc.) – atian25 Jun 20 '12 at 03:20
Hi, take a look to https://github.com/eugenehp/node-crawler then, it just crawl the webpage, no webkit, no rendering capabilities. But feature of fast loading and DOM manipulating. – Eugene Hauptmann Jun 20 '12 at 05:09
thanks, node.io / jsdom can do it too, just wonder why phantomjs don't support this feature – atian25 Jun 21 '12 at 09:19

score 6 · Answer 3 · answered Jun 09 '15 at 12:41

Use page.onResourceRequested, as in example loadurlwithoutcss.js:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || 
            requestData.headers['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

score 3 · Answer 4 · answered Oct 25 '12 at 02:06

3

No way for now (phantomjs 1.7), it does NOT support that.

But a nasty solution is using a http proxy, so you can screen out some request that you don't need

answered Oct 25 '12 at 02:06

SHAWN

31
2

Of course this is the best solution, btw you should always use a proxy (varnish or squid) to "control" what your programs are downloading (to add queuing, caching etc....) – Thomas Decaux Jun 25 '13 at 13:46

How can I control PhantomJS to skip download some kind of resource?

4 Answers4

Linked