2

I am new to PhantomJS and trying to capture the homepage of Trade Me. Here's my code so far:

var page = require('webpage').create();

page.open('http://trademe.co.nz', function () {

  // Checks for bottom div and scrolls down from time to time
  window.setInterval(function() {
      // Checks if there is a div with class=".has-more-items" 
      // (not sure if this is the best way of doing it)
      // var count = page.content.match(/class=".site-footer"/g);
      var footer_visible = page.evaluate(function() {
        return $('.site-footer').is(':visible');
      });

      if(!footer_visible) { // Didn't find
        console.log('Scrolling');
        page.evaluate(function() {
          // Scrolls to the bottom of page
          window.document.body.scrollTop = document.body.scrollHeight;
        });
      }
      else { // Found
        console.log('Found');
        // Do what you want
        window.setTimeout( function() {
            console.log('Capturing');
            page.render('phantom-capture.png', {format: 'png'});
            phantom.exit();
        }, 10000);
      }
  }, 1000); // Number of milliseconds to wait between scrolls

});

There are several things that baffle me:

  1. The word Scrolling never gets printed.
  2. It eventually gets to Found, and the word is printed 10 times. I assume that's because it is contained within the setInterval block with a 1 second interval, and there's a 10 second wait caused by the setTimeout?
  3. The page is finally rendered to the PNG file, but the content of those asynchronously loaded panels are still empty, and showing the Loading... message.

I'm new to all this and my knowledge of Javascript is very rusty.

codedog
  • 2,488
  • 9
  • 38
  • 67

2 Answers2

3

You are running into a common problem of how to tell when a webpage has fully loaded. This is actually quite hard! I wrote a blog post a long while back about this very problem: https://sorcery.smugmug.com/2013/12/17/using-phantomjs-at-scale/ (see problem #1) Here is my feedback on your code and problem:

First, you don't need to scroll to know if the footer is loaded, jQuery's :visible selector will return true if the element takes up space in the document, not if it is within the viewport: https://api.jquery.com/visible-selector/ . I'd also not use PhantomJS's viewport visibility in general, since it does run headless.

Second, the page.open() callback will fire when the page has 'loaded' according to PhantomJS. This mostly means when it has fully loaded the HTML and all its included assets. However, this does not mean that asynchronously loaded content has loaded.

Third, I believe you see the output 'Found' ten times because you are using window.setInterval to check for the footer and using window.setTimeout to do the rendering. What is happening is this:

  1. PhantomJS starts loading the page and calls your callback passed to page.open() once loaded.
  2. The footer is visible on load, so footer_visible is true
  3. The 'found' block runs for the first time. This sets up a function to run in 10 seconds in the future that renders the page, then exits. BUT because it's using window.setTimeout, your script continues.
  4. The script continues, and since your outer function is set up to run every second, it runs again! It checks for the footer, finds it and sets up a function to run in 10s to render the page. It continues to do this for 10 seconds.
  5. After 10 seconds, the first function that was set up to render the page does this and then tells PhantomJS to exit. This kills all the other functions that were setup to render the page in 10 seconds.

If you really want to render the page when the footer is in the document, here is your fixed code:

var page = require('webpage').create();

page.open('http://trademe.co.nz', function () {

    window.setInterval(function() {
        var footer_visible = page.evaluate(function() {
            return $('.site-footer').is(':visible');
        });

        if(footer_visible) {
            page.render('phantom-capture.png', {format: 'png'});
            phantom.exit();
        }
    }, 1000);
});

However, this will not render once all content is loaded, that is a much harder problem. Please read my blog post linked to above for tips on how to do this. It's a really hard problem. If you don't want to read my blog post, here's a TLDR;

Through a lot of manual testing and QA we eventually came to a solution where we tracked each and every HTTP request PhantomJS makes and watch every step of the transaction (start, progress, end, failed). Only once every single request has completed (or failed, etc) we start ‘waiting’. We give the page 500ms to either start making more requests or finish adding content to the DOM. After that timeout we assume the page is done.

Ryan Doherty
  • 38,580
  • 4
  • 56
  • 63
  • [This question](http://stackoverflow.com/q/11340038/1816580) has got you covered. I find that these answers have the best ideas: [1](http://stackoverflow.com/a/21401636/1816580), [2](http://stackoverflow.com/a/38468106/1816580), [3](http://stackoverflow.com/a/38132403/1816580). – Artjom B. Oct 13 '16 at 18:51
2

Ryan Doherty has provided great explanation as to why console.log('Scrolling'); never gets called and you figured out why Found is printed 10 times yourself!

And I'd like to talk about how to deal with those ajaxified pages. Generally when you work with such sites you can figure out a criterion by which to judge if the page has loaded, or at least parts of it that you need (though sometimes, as Ryan rightfully notes, it can be very hard, especially if there are a lot of external resources and/or iframes on a page).

In this very case I suppose we can decide that the page has loaded when there is no "Loading" labels left. So we turn off javascript and inspect those labels. Turns out they are <div class="carousel-loading-card">. That means we only have to wait till they are gone. But to trigger their loading we must simulate page scrolling. In PhantomJS you can "natively" do that by changing page.scrollPosition setting.

var page = require('webpage').create();

// Let's not confuse the target site by our default useragent 
// and native viewport dinemsions of 400x300
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0';
page.viewportSize = { width: 1280, height: 1024 };

var totalHeight, scroll = 0;

page.open('http://trademe.co.nz', function(){

    totalHeight = page.evaluate(function(){
        return $(document).height();
    });

    wait();

});

function wait()
{
    var loading = page.evaluate(function(){
        return $(".carousel-loading-card").length;
    });

    if(loading > 0) {

        if(scroll <= totalHeight)
        {
            scroll += 200;

            page.scrollPosition = {
                top: scroll,
                left: 0
            };

            page.render('trademe-' + (new Date()).getTime() + '.jpg');
        }

        console.log(loading + " panels left. Scroll: " + scroll + "px");
        setTimeout(wait, 3000);        

    } else {
        // Restore defaults to make a full page screenshot at the end
        page.scrollPosition = { top: 0, left: 0 };        
        page.render('trademe-ready.png');
        phantom.exit();
    }

}
Vaviloff
  • 16,282
  • 6
  • 48
  • 56