1

I am loading pages with the PhantomJS and it takes about 20s for each request, so I want to speed up it.

browser.createPage((err, page) => {
    page.set('settings', {
        userAgent: random_ua.generate(),
        javascriptEnabled: true,
        loadImages: false
    });

    return page.open(url, (err,status) => {
        if (err) {
            console.log('Error:', err);
            onError();
        }

        if (status == 'success') {
            page.evaluate(function () {
                return document.body.innerHTML.trim();
            }, (err,result) => {

                console.log('Execution time: ' + ((new Date).getTime() - time) / 1000 + 's');

                browser.exit();
                resolve(result)
            });
        } else {
            console.log('Status:', status);
            onError();
        }

    });
}

As I seen it waits for full loading of the page and external resources (css, js, etc).

How can I resolve html as soon as it was loaded without delays for loading external links?

Mike
  • 137
  • 3
  • 14
  • If you just want the html, why are you using phantomjs? See the comment on this: http://stackoverflow.com/a/20174298/484780 – Kevin Jantzer Feb 06 '17 at 19:13
  • @KevinJantzer Because OP probably wants the resulting HTML of the page which is shaped by javascript? – Vaviloff Feb 06 '17 at 19:20
  • But if that's the case, you have to wait for the page to fully load external resources (as the OP said he wanted to do without) – Kevin Jantzer Feb 06 '17 at 19:27
  • No, you only have to wait for javascript to load. No need to get all those multi-megabyte icon fonts css abominations and render them. – Vaviloff Feb 06 '17 at 19:30

2 Answers2

0

I believe your waiting for the page.open request to return successful then evaluate() method which is going to take time. Maybe you can try using evaluateAsync().

evaluateAsync(): Evaluates the given function in the context of the web page, without blocking the current execution. The function returns immediately and there is no return value. This is useful to run some script asynchronously

http://phantomjs.org/api/webpage/method/evaluate-async.html

Donald Powell
  • 744
  • 5
  • 10
0

Not sure what automation script you're using, so will point to vanilla PhantomJS solution.

onResourceRequested allows to abort request for a resource or to redirect it elsewhere.

From the official example «Load url without css»:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData.headers['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};
Vaviloff
  • 16,282
  • 6
  • 48
  • 56
  • There is still the issue. I make the request to my own proxy server, so I can observe the loading URL. For example, I request '2ip.ru' using phantomjs, but my proxy processes requests to the domain for two times http://prntscr.com/e5ktsw other requests (duration >10s) are excessive for me – Mike Feb 07 '17 at 11:10
  • Not sure I've got it - are double requests an issue with proxy or PhantomJS ? In any case, you can abort requests on any basis (path, scheme, domain, keyword) - may regex help you :) proxy looks nice, what's that? – Vaviloff Feb 07 '17 at 11:28
  • I think that when I execute page.open() all external scripts starts to loading (screen of log is above), and only after that (delay >10s) starts working onResourceRequested. If I put this to my phantomJS script: page.onResourceRequested =(requestData, request) =>{ console.log('Request'); }; I'll receive the data and onResourceRequested will console me 'Request' after delay (I want to avoid delay). Proxy implemented on node js server, here it is - http://pastebin.com/eDSuxmbS – Mike Feb 07 '17 at 12:57