142

I'm using PhantomJS v1.4.1 to load some web pages. I don't have access to their server-side, I just getting links pointing to them. I'm using obsolete version of Phantom because I need to support Adobe Flash on that web pages.

The problem is many web-sites are loading their minor content async and that's why Phantom's onLoadFinished callback (analogue for onLoad in HTML) fired too early when not everything still has loaded. Can anyone suggest how can I wait for full load of a webpage to make, for example, a screenshot with all dynamic content like ads?

Cybermaxs
  • 24,378
  • 8
  • 83
  • 112
nilfalse
  • 2,380
  • 2
  • 19
  • 16

14 Answers14

80

Another approach is to just ask PhantomJS to wait for a bit after the page has loaded before doing the render, as per the regular rasterize.js example, but with a longer timeout to allow the JavaScript to finish loading additional resources:

page.open(address, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    } else {
        window.setTimeout(function () {
            page.render(output);
            phantom.exit();
        }, 1000); // Change timeout as required to allow sufficient time 
    }
});
rhunwicks
  • 3,198
  • 24
  • 21
  • 1
    Yes, currently I stuck to this approach. – nilfalse Feb 11 '13 at 06:52
  • 107
    It's a horrible solution, sorry (it's PhantomJS's fault!). If you wait a full second, but it takes 20ms to load, it's a complete waste of time (think batch jobs), or if it takes longer than a second, it will still fail. Such inefficiency and unreliability is unbearable for professional work. – CodeManX Jul 15 '15 at 09:35
  • 12
    The real problem here is that you never know when javascript will finish loading page and browser doesn't know it as well. Imagine site which has some javascript loading something from server in infinite loop. From the browser point of view - javascript execution is never ending so what is that moment you want phantomjs to tell you that it has finished? This problem is unsolvable in generic case except with wait for timeout solution and hope for the best. – Maxim Galushka Sep 07 '15 at 11:10
  • 1
    I agree that this solution is horrible. It's a race condition bug, the arbitray 1000ms is not a guarantee the rendering will finish by then. – Tim Oct 28 '15 at 08:21
  • Compare to the other [*answer*](http://stackoverflow.com/a/14748934/4058484), this one is work for me. I'd say `setTimeout` is saver than `waitfor` incase an unfinished loop on the page rendering. – eQ19 Feb 11 '16 at 13:59
  • @alex88 The OP should NOT accept an answer that is not a SOLUTION. This workaround is common sense and can produce a result but at a substantial cost (time). So, it is definitely not a solution. – Brandon Elliott May 20 '16 at 18:36
  • 6
    Is this still the best solution as of 2016? It seems like we should be able to do better than this. – Adam Thompson Nov 07 '16 at 22:28
  • 6
    If you are in control of the code you're trying to read, you can call the phantom js call back explicitly: http://phantomjs.org/api/webpage/handler/on-callback.html – Andy Smith Dec 04 '16 at 15:17
53

I would rather periodically check for document.readyState status (https://developer.mozilla.org/en-US/docs/Web/API/document.readyState). Although this approach is a bit clunky, you can be sure that inside onPageReady function you are using fully loaded document.

var page = require("webpage").create(),
    url = "http://example.com/index.html";

function onPageReady() {
    var htmlContent = page.evaluate(function () {
        return document.documentElement.outerHTML;
    });

    console.log(htmlContent);

    phantom.exit();
}

page.open(url, function (status) {
    function checkReadyState() {
        setTimeout(function () {
            var readyState = page.evaluate(function () {
                return document.readyState;
            });

            if ("complete" === readyState) {
                onPageReady();
            } else {
                checkReadyState();
            }
        });
    }

    checkReadyState();
});

Additional explanation:

Using nested setTimeout instead of setInterval prevents checkReadyState from "overlapping" and race conditions when its execution is prolonged for some random reasons. setTimeout has a default delay of 4ms (https://stackoverflow.com/a/3580085/1011156) so active polling will not drastically affect program performance.

document.readyState === "complete" means that document is completely loaded with all resources (https://html.spec.whatwg.org/multipage/dom.html#current-document-readiness).

EDIT 2022: I created this response 8 years ago and I did not use PhantomJS since then. It is very probable it won't work now in some cases. Also now I think it is not possible to create a one-size-fits-all solution to be absolutely sure the page is loaded. This is because some pages may load additional resources after document is ready. For example, there might be some JS code on the website that waits for the document to be ready an then loads some additional assets (after document state changes to ready) - in this case the onPageReady will trigger and after that the page will start loading some more resources again.

I still think the above snipped is a good starting point and may work in most cases, but may also necessary to create a specific solutions to handle specific websites.

Mateusz Charytoniuk
  • 1,820
  • 20
  • 31
  • 4
    the comment on setTimeout vs setInterval is great. – Gal Bracha Jul 02 '15 at 15:49
  • 1
    `readyState` will only trigger once the DOM has been fully loaded, however any ` – CodingIntrigue Nov 29 '15 at 08:36
  • 1
    @rgraham It's not ideal but I think we can only do so much with these renderers. There are going to be edge cases where you just won't know if something is loaded fully. Think about a page where content is delayed, on purpose, by a minute or two. It is unreasonable to expect the render process to sit around and wait an indefinate amount of time. The same goes for content loaded from external sources that may be slow. – Brandon Elliott May 20 '16 at 18:41
  • 3
    This doesn't consider any JavaScript loading after DOM fully loads, such as with Backbone/Ember/Angular. – Adam Thompson Nov 07 '16 at 22:30
  • 1
    Didn't work at all for me. readyState complete may well have fired, but the page was blank at this point. – Steve Staple Apr 11 '17 at 09:37
  • doesn't work at all... and even after making it working readyState is complete before it's ready. – Flash Thunder Jan 08 '18 at 22:00
  • Doesn't work as said before in my case it renders incomplete page. – lisandro Oct 27 '21 at 17:10
21

You could try a combination of the waitfor and rasterize examples:

/**
 * See https://github.com/ariya/phantomjs/blob/master/examples/waitfor.js
 * 
 * Wait until the test condition is true or a timeout occurs. Useful for waiting
 * on a server response or for a ui change (fadeIn, etc.) to occur.
 *
 * @param testFx javascript condition that evaluates to a boolean,
 * it can be passed in as a string (e.g.: "1 == 1" or "$('#bar').is(':visible')" or
 * as a callback function.
 * @param onReady what to do when testFx condition is fulfilled,
 * it can be passed in as a string (e.g.: "1 == 1" or "$('#bar').is(':visible')" or
 * as a callback function.
 * @param timeOutMillis the max amount of time to wait. If not specified, 3 sec is used.
 */
function waitFor(testFx, onReady, timeOutMillis) {
    var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s
        start = new Date().getTime(),
        condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()), //< defensive code
        interval = setInterval(function() {
            if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) {
                // If not time-out yet and condition not yet fulfilled
                condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code
            } else {
                if(!condition) {
                    // If condition still not fulfilled (timeout but condition is 'false')
                    console.log("'waitFor()' timeout");
                    phantom.exit(1);
                } else {
                    // Condition fulfilled (timeout and/or condition is 'true')
                    console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
                    typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condition is fulfilled
                    clearInterval(interval); //< Stop this interval
                }
            }
        }, 250); //< repeat check every 250ms
};

var page = require('webpage').create(), system = require('system'), address, output, size;

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: rasterize.js URL filename [paperwidth*paperheight|paperformat] [zoom]');
    console.log('  paper (pdf output) examples: "5in*7.5in", "10cm*20cm", "A4", "Letter"');
    phantom.exit(1);
} else {
    address = system.args[1];
    output = system.args[2];
    if (system.args.length > 3 && system.args[2].substr(-4) === ".pdf") {
        size = system.args[3].split('*');
        page.paperSize = size.length === 2 ? {
            width : size[0],
            height : size[1],
            margin : '0px'
        } : {
            format : system.args[3],
            orientation : 'portrait',
            margin : {
                left : "5mm",
                top : "8mm",
                right : "5mm",
                bottom : "9mm"
            }
        };
    }
    if (system.args.length > 4) {
        page.zoomFactor = system.args[4];
    }
    var resources = [];
    page.onResourceRequested = function(request) {
        resources[request.id] = request.stage;
    };
    page.onResourceReceived = function(response) {
        resources[response.id] = response.stage;
    };
    page.open(address, function(status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
            phantom.exit();
        } else {
            waitFor(function() {
                // Check in the page if a specific element is now visible
                for ( var i = 1; i < resources.length; ++i) {
                    if (resources[i] != 'end') {
                        return false;
                    }
                }
                return true;
            }, function() {
               page.render(output);
               phantom.exit();
            }, 10000);
        }
    });
}
rhunwicks
  • 3,198
  • 24
  • 21
  • 3
    Seems like it wouldn't work with web pages, that use any of server push technologies, as resource will still be in use after onLoad occured. – nilfalse Feb 11 '13 at 06:57
  • Do any drivers, eg. [poltergeist](https://github.com/jonleighton/poltergeist), have a feature like this? – Jared Beck Dec 24 '13 at 19:57
  • Is it possible to use waitFor to poll the whole html text and search for a defined keyword? I tried to implement this but it seems that the polling does not refresh to the latest downloaded html source. – fpdragon Feb 24 '16 at 09:19
16

Here is a solution that waits for all resource requests to complete. Once complete it will log the page content to the console and generate a screenshot of the rendered page.

Although this solution can serve as a good starting point, I have observed it fail so it's definitely not a complete solution!

I didn't have much luck using document.readyState.

I was influenced by the waitfor.js example found on the phantomjs examples page.

var system = require('system');
var webPage = require('webpage');

var page = webPage.create();
var url = system.args[1];

page.viewportSize = {
  width: 1280,
  height: 720
};

var requestsArray = [];

page.onResourceRequested = function(requestData, networkRequest) {
  requestsArray.push(requestData.id);
};

page.onResourceReceived = function(response) {
  var index = requestsArray.indexOf(response.id);
  if (index > -1 && response.stage === 'end') {
    requestsArray.splice(index, 1);
  }
};

page.open(url, function(status) {

  var interval = setInterval(function () {

    if (requestsArray.length === 0) {

      clearInterval(interval);
      var content = page.content;
      console.log(content);
      page.render('yourLoadedPage.png');
      phantom.exit();
    }
  }, 500);
});
Yvan
  • 2,539
  • 26
  • 28
Dave
  • 2,126
  • 1
  • 15
  • 18
  • Gave a thumbs-up, but used setTimeout with 10, instead of interval – GDmac Feb 28 '17 at 15:07
  • 1
    You should check that response.stage is equal to 'end' before removing it from the requests array, otherwise it might be removed prematurely. – Reimund Mar 22 '17 at 09:19
  • 1
    This does not work if your webpage loads the DOM dynamically – Buddy Jun 15 '17 at 19:58
14

Maybe you can use the onResourceRequested and onResourceReceived callbacks to detect asynchronous loading. Here's an example of using those callbacks from their documentation:

var page = require('webpage').create();
page.onResourceRequested = function (request) {
    console.log('Request ' + JSON.stringify(request, undefined, 4));
};
page.onResourceReceived = function (response) {
    console.log('Receive ' + JSON.stringify(response, undefined, 4));
};
page.open(url);

Also, you can look at examples/netsniff.js for a working example.

Supr
  • 18,572
  • 3
  • 31
  • 36
  • But in this case I can't use one instance of PhantomJS to load more than one page at a time, right? – nilfalse Jul 10 '12 at 05:01
  • Does onResourceRequested apply to AJAX/Cross Domain requests? Or does it apply only to like css, images.. etc? – CMCDragonkai Sep 24 '13 at 08:49
  • @CMCDragonkai I have never used it myself, but based on [this](https://github.com/ariya/phantomjs/wiki/Network-Monitoring) it seems like it includes all requests. Quote: `All the resource requests and responses can be sniffed using onResourceRequested and onResourceReceived` – Supr Sep 24 '13 at 09:13
  • I have used this method with large scale PhantomJS rendering and it works quite well. You do need a lot of smarts to track requests and watch if they fail or timeout. More info: https://sorcery.smugmug.com/2013/12/17/using-phantomjs-at-scale/ – Ryan Doherty Oct 13 '16 at 16:12
13

In my program, I use some logic to judge if it was onload: watching it's network request, if there was no new request on past 200ms, I treat it onload.

Use this, after onLoadFinish().

function onLoadComplete(page, callback){
    var waiting = [];  // request id
    var interval = 200;  //ms time waiting new request
    var timer = setTimeout( timeout, interval);
    var max_retry = 3;  //
    var counter_retry = 0;

    function timeout(){
        if(waiting.length && counter_retry < max_retry){
            timer = setTimeout( timeout, interval);
            counter_retry++;
            return;
        }else{
            try{
                callback(null, page);
            }catch(e){}
        }
    }

    //for debug, log time cost
    var tlogger = {};

    bindEvent(page, 'request', function(req){
        waiting.push(req.id);
    });

    bindEvent(page, 'receive', function (res) {
        var cT = res.contentType;
        if(!cT){
            console.log('[contentType] ', cT, ' [url] ', res.url);
        }
        if(!cT) return remove(res.id);
        if(cT.indexOf('application') * cT.indexOf('text') != 0) return remove(res.id);

        if (res.stage === 'start') {
            console.log('!!received start: ', res.id);
            //console.log( JSON.stringify(res) );
            tlogger[res.id] = new Date();
        }else if (res.stage === 'end') {
            console.log('!!received end: ', res.id, (new Date() - tlogger[res.id]) );
            //console.log( JSON.stringify(res) );
            remove(res.id);

            clearTimeout(timer);
            timer = setTimeout(timeout, interval);
        }

    });

    bindEvent(page, 'error', function(err){
        remove(err.id);
        if(waiting.length === 0){
            counter_retry = 0;
        }
    });

    function remove(id){
        var i = waiting.indexOf( id );
        if(i < 0){
            return;
        }else{
            waiting.splice(i,1);
        }
    }

    function bindEvent(page, evt, cb){
        switch(evt){
            case 'request':
                page.onResourceRequested = cb;
                break;
            case 'receive':
                page.onResourceReceived = cb;
                break;
            case 'error':
                page.onResourceError = cb;
                break;
            case 'timeout':
                page.onResourceTimeout = cb;
                break;
        }
    }
}
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
deemstone
  • 170
  • 1
  • 6
11

I found this approach useful in some cases:

page.onConsoleMessage(function(msg) {
  // do something e.g. page.render
});

Than if you own the page put some script inside:

<script>
  window.onload = function(){
    console.log('page loaded');
  }
</script>
Brankodd
  • 831
  • 9
  • 21
  • This looks like a really nice work-around, however, I could not get any log message from my HTML/JavaScript page to pass through phantomJS... the onConsoleMessage event never triggered while I could see the messages perfectly on the Browser console, and I have no clue why. – Dirk Sep 01 '15 at 17:29
  • 1
    I needed page.onConsoleMessage = function(msg){}; – Andy Balaam Feb 19 '16 at 17:04
5

I found this solution useful in a NodeJS app. I use it just in desperate cases because it launches a timeout in order to wait for the full page load.

The second argument is the callback function which is going to be called once the response is ready.

phantom = require('phantom');

var fullLoad = function(anUrl, callbackDone) {
    phantom.create(function (ph) {
        ph.createPage(function (page) {
            page.open(anUrl, function (status) {
                if (status !== 'success') {
                    console.error("pahtom: error opening " + anUrl, status);
                    ph.exit();
                } else {
                    // timeOut
                    global.setTimeout(function () {
                        page.evaluate(function () {
                            return document.documentElement.innerHTML;
                        }, function (result) {
                            ph.exit(); // EXTREMLY IMPORTANT
                            callbackDone(result); // callback
                        });
                    }, 5000);
                }
            });
        });
    });
}

var callback = function(htmlBody) {
    // do smth with the htmlBody
}

fullLoad('your/url/', callback);
Manu Artero
  • 9,238
  • 6
  • 58
  • 73
3

This is an implementation of Supr's answer. Also it uses setTimeout instead of setInterval as Mateusz Charytoniuk suggested.

Phantomjs will exit in 1000ms when there isn't any request or response.

// load the module
var webpage = require('webpage');
// get timestamp
function getTimestamp(){
    // or use Date.now()
    return new Date().getTime();
}

var lastTimestamp = getTimestamp();

var page = webpage.create();
page.onResourceRequested = function(request) {
    // update the timestamp when there is a request
    lastTimestamp = getTimestamp();
};
page.onResourceReceived = function(response) {
    // update the timestamp when there is a response
    lastTimestamp = getTimestamp();
};

page.open(html, function(status) {
    if (status !== 'success') {
        // exit if it fails to load the page
        phantom.exit(1);
    }
    else{
        // do something here
    }
});

function checkReadyState() {
    setTimeout(function () {
        var curentTimestamp = getTimestamp();
        if(curentTimestamp-lastTimestamp>1000){
            // exit if there isn't request or response in 1000ms
            phantom.exit();
        }
        else{
            checkReadyState();
        }
    }, 100);
}

checkReadyState();
Dayong
  • 5,614
  • 1
  • 14
  • 6
3

This the code I use:

var system = require('system');
var page = require('webpage').create();

page.open('http://....', function(){
      console.log(page.content);
      var k = 0;

      var loop = setInterval(function(){
          var qrcode = page.evaluate(function(s) {
             return document.querySelector(s).src;
          }, '.qrcode img');

          k++;
          if (qrcode){
             console.log('dataURI:', qrcode);
             clearInterval(loop);
             phantom.exit();
          }

          if (k === 50) phantom.exit(); // 10 sec timeout
      }, 200);
  });

Basically given the fact you're supposed to know that the page is full downloaded when a given element appears on the DOM. So the script is going to wait until this happens.

Rocco Musolino
  • 610
  • 10
  • 22
3

I use a personnal blend of the phantomjs waitfor.js example.

This is my main.js file:

'use strict';

var wasSuccessful = phantom.injectJs('./lib/waitFor.js');
var page = require('webpage').create();

page.open('http://foo.com', function(status) {
  if (status === 'success') {
    page.includeJs('https://cdnjs.cloudflare.com/ajax/libs/jquery/3.1.1/jquery.min.js', function() {
      waitFor(function() {
        return page.evaluate(function() {
          if ('complete' === document.readyState) {
            return true;
          }

          return false;
        });
      }, function() {
        var fooText = page.evaluate(function() {
          return $('#foo').text();
        });

        phantom.exit();
      });
    });
  } else {
    console.log('error');
    phantom.exit(1);
  }
});

And the lib/waitFor.js file (which is just a copy and paste of the waifFor() function from the phantomjs waitfor.js example):

function waitFor(testFx, onReady, timeOutMillis) {
    var maxtimeOutMillis = timeOutMillis ? timeOutMillis : 3000, //< Default Max Timout is 3s
        start = new Date().getTime(),
        condition = false,
        interval = setInterval(function() {
            if ( (new Date().getTime() - start < maxtimeOutMillis) && !condition ) {
                // If not time-out yet and condition not yet fulfilled
                condition = (typeof(testFx) === "string" ? eval(testFx) : testFx()); //< defensive code
            } else {
                if(!condition) {
                    // If condition still not fulfilled (timeout but condition is 'false')
                    console.log("'waitFor()' timeout");
                    phantom.exit(1);
                } else {
                    // Condition fulfilled (timeout and/or condition is 'true')
                    // console.log("'waitFor()' finished in " + (new Date().getTime() - start) + "ms.");
                    typeof(onReady) === "string" ? eval(onReady) : onReady(); //< Do what it's supposed to do once the condi>
                    clearInterval(interval); //< Stop this interval
                }
            }
        }, 250); //< repeat check every 250ms
}

This method is not asynchronous but at least am I assured that all the resources were loaded before I try using them.

Daishi
  • 12,681
  • 1
  • 19
  • 22
2

This is an old question, but since I was looking for full page load but for Spookyjs (that uses casperjs and phantomjs) and didn't find my solution, I made my own script for that, with the same approach as the user deemstone . What this approach does is, for a given quantity of time, if the page did not receive or started any request it will end the execution.

On casper.js file (if you installed it globally, the path would be something like /usr/local/lib/node_modules/casperjs/modules/casper.js) add the following lines:

At the top of the file with all the global vars:

var waitResponseInterval = 500
var reqResInterval = null
var reqResFinished = false
var resetTimeout = function() {}

Then inside function "createPage(casper)" just after "var page = require('webpage').create();" add the following code:

 resetTimeout = function() {
     if(reqResInterval)
         clearTimeout(reqResInterval)

     reqResInterval = setTimeout(function(){
         reqResFinished = true
         page.onLoadFinished("success")
     },waitResponseInterval)
 }
 resetTimeout()

Then inside "page.onResourceReceived = function onResourceReceived(resource) {" on the first line add:

 resetTimeout()

Do the same for "page.onResourceRequested = function onResourceRequested(requestData, request) {"

Finally, on "page.onLoadFinished = function onLoadFinished(status) {" on the first line add:

 if(!reqResFinished)
 {
      return
 }
 reqResFinished = false

And that's it, hope this one helps someone in trouble like I was. This solution is for casperjs but works directly for Spooky.

Good luck !

fdnieves
  • 41
  • 6
0

this is my solution its worked for me .

page.onConsoleMessage = function(msg, lineNum, sourceId) {

    if(msg=='hey lets take screenshot')
    {
        window.setInterval(function(){      
            try
            {               
                 var sta= page.evaluateJavaScript("function(){ return jQuery.active;}");                     
                 if(sta == 0)
                 {      
                    window.setTimeout(function(){
                        page.render('test.png');
                        clearInterval();
                        phantom.exit();
                    },1000);
                 }
            }
            catch(error)
            {
                console.log(error);
                phantom.exit(1);
            }
       },1000);
    }       
};


page.open(address, function (status) {      
    if (status !== "success") {
        console.log('Unable to load url');
        phantom.exit();
    } else { 
       page.setContent(page.content.replace('</body>','<script>window.onload = function(){console.log(\'hey lets take screenshot\');}</script></body>'), address);
    }
});
Tom
  • 1
0

Do Mouse move while page is loading should work.

 page.sendEvent('click',200, 660);

do { phantom.page.sendEvent('mousemove'); } while (page.loading);

UPDATE

When submitting the form, nothing was returned, so the program stopped. The program did not wait for the page to load as it took a few seconds for the redirect to begin.

telling it to move the mouse until the URL changes to the home page gave the browser as much time as it needed to change. then telling it to wait for the page to finish loading allowed the page to full load before the content was grabbed.

page.evaluate(function () {
document.getElementsByClassName('btn btn-primary btn-block')[0].click();
});
do { phantom.page.sendEvent('mousemove'); } while (page.evaluate(function()
{
return document.location != "https://www.bestwaywholesale.co.uk/";
}));
do { phantom.page.sendEvent('mousemove'); } while (page.loading);
Alan P
  • 41
  • 8