24

Open a web page and take a screenshot.

Using ONLY phantomjs: (this is a simple script, in fact it is the example script used in their docs. http://phantomjs.org/screen-capture.html

var page = require('webpage').create();
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

Problem is that for some websites (like github) funny enough are somehow detecting and not serving phantomjs and nothing is being rendered. Result is github.png is a blank white png file.

Replace github with say: "google.com" and you get a nice (proper) screenshot as is intended.

At first I thought this was a Phantomjs issue so I tried running it through Casperjs with:

casper.start('http://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

But I get same behavior as with Phantomjs.

So I figured ok this is most likely a user agent issue. As in: Github sniffs out Phantomjs and decides not to show the page. So I set the user agent like below but that still didn't work.

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

So then I tried to parse the page and apparently some sites (again like github) don't appear to be sending anything down the wire.

Using casperjs I tried to print the title. And for google.com I got back Google but for github.com I got back bupkis. Example code:

var casper = require('casper').create();

casper.start('http://github.com/', function() {
    this.echo(this.getTitle());
});

casper.run();  

The same as above also produces the same result in purely phantomjs.

Update:

Could this be a timing issue? Is github just super slow? I doubt it but lets test anyway..

var page = require('webpage').create();
page.open('http://github.com', function (status) {
    /* irrelevant */
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

And the result is still bupkis. So no it's not a timing issue.

  1. How are some sites like github blocking phantomjs?
  2. How can we reliably take screenshots of ALL webpages? Required to be fast, and headless.
Community
  • 1
  • 1
MrPizzaFace
  • 7,807
  • 15
  • 79
  • 123
  • The most reliable would probably be a headless firefox solution (watir/webdriver?) – pguardiario Oct 23 '14 at 01:10
  • @pguardiario, thanks ive seen your posts. Watir web driver has worked well for me in the past, but usually on the slower side. I have used it for tests and small scraping jobs... Is their an easy way to deploy watir on heroku or ec2 on a production app? – MrPizzaFace Oct 23 '14 at 01:27
  • I've used watir-webdriver on ec2 ubuntu instances and it was always striaghtforward. – pguardiario Oct 23 '14 at 01:40
  • Yeah if phantomjs fails me, I will probably fall back on watir if its the only reliable play. – MrPizzaFace Oct 23 '14 at 01:45
  • 1
    possible duplicate of [PhantomJS failing to open HTTPS site](http://stackoverflow.com/questions/12021578/phantomjs-failing-to-open-https-site) – Artjom B. Oct 23 '14 at 06:18
  • and the same for CasperJS: [CasperJS/PhantomJS doesn't load https page](http://stackoverflow.com/questions/26415188/casperjs-phantomjs-doesnt-load-https-page) – Artjom B. Oct 23 '14 at 06:19
  • @artjom good job on the syntax fix. I was able to test it real quick but unfortunately it still does not properly render the page image. It does however remove the NoMethodError so kudos to you but we're still not all they way there. I will update my answer to reflect the progress. Thanks for sharing! ;) – MrPizzaFace Oct 23 '14 at 21:35
  • 1
    So, I tried [this](https://gist.github.com/artjomb/c9cb625b10bd0f697e84) with `--ssl-protocol=tlsv1` and it produced [this image](http://i.imgur.com/ibJJXhU.png). Now I can't see anything wrong with the picture. Is there another issue that you're having? – Artjom B. Oct 23 '14 at 21:47
  • @ArtjomB. Yes that is odd. I did it with `--ssl-protocol=any` and got a different image. See my updated answer... – MrPizzaFace Oct 23 '14 at 21:49
  • 1
    The images on github are webfonts. There were problems in the past with them. Are you on linux? Do you use the current stable PhantomJS version (1.9.7)? You may want to compile PhantomJS 2 or try SlimerJS. – Artjom B. Oct 23 '14 at 21:59

1 Answers1

25

After bouncing this around for some time I was able to narrow down the problem. Apparently PhantomJS uses a default ssl of sslv3 which causes github to refuse the connection due to a bad ssl handshake

phantomjs --debug=true github.js

Shows output of:

. . .
2014-10-22T19:48:31 [DEBUG] WebPage - updateLoadingProgress: 10 
2014-10-22T19:48:32 [DEBUG] Network - Resource request error: 6 ( "SSL handshake failed" ) URL: "https://github.com/" 
2014-10-22T19:48:32 [DEBUG] WebPage - updateLoadingProgress: 100 

So from this we can conclude that no screen was taken because github was refusing the connection. Great that makes perfect sense. So let's set SSL flag to --ssl-protocol=any and lets also ignore ssl-errors with --ignore-ssl-errors=true

phantomjs --ignore-ssl-errors=true --ssl-protocol=any --debug=true github.js

Great success! A screenshot is now being rendered and saved properly but debugger is showing us a TypeError:

TypeError: 'undefined' is not a function (evaluating 'Array.prototype.forEach.call.bind(Array.prototype.forEach)')

  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 72 
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 88 
ReferenceError: Can't find variable: $

  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1
  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1

I checked the github homepage manually just to see if a TypeError existed and it does NOT.

My next guess is that the assets aren't loading quick enough.. Phantomjs is faster than a speeding bullet!

So lets try to slow it down artificially and see if we can get rid of that TypeError...

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

That didn't work... After a closer inspection of the image - it is clear that some elements are missing. Mainly some icons and the logo.

Success? Partially because we are now at least getting a screen shot where earlier, we weren't getting a thing.

Job done? Not exactly. Need to determine what is causing that TypeError because it preventing some assets from loading and distorting the image.

Additional

Attempted to recreate with CasperJS --debug is very ugly and hard to follow compared to PhantomJS:

casper.start();
casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
casper.thenOpen('https://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

console:

casperjs test --ssl-protocol=any --debug=true github.js

Further the image is missing the same icons but is also visually distorted. Being that CasperJs relies on Phantomjs, I do not see the value in using it for this specific task.

If you would like to add to my answer, please share your findings. Very interested in a flawless PhantomJS solution

Update #1 : Removing the TypeError

@ArtjomB points out that Phantomjs does not support js bind in it's current version as of this update (1.9.7). For this reason he explains: ArtjomB: PhantomJs Bind Issue Answer

The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 will use a newer engine which will support bind. For now you need to add a shim inside of the page.onInitialized event handler:

Ok great, so the following code will take care of our TypeError from above. (But not fully functional, see below for details)

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 5000);
});
page.onInitialized = function(){
    page.evaluate(function(){
        var isFunction = function(o) {
          return typeof o == 'function';
        };

        var bind,
          slice = [].slice,
          proto = Function.prototype,
          featureMap;

        featureMap = {
          'function-bind': 'bind'
        };

        function has(feature) {
          var prop = featureMap[feature];
          return isFunction(proto[prop]);
        }

        // check for missing features
        if (!has('function-bind')) {
          // adapted from Mozilla Developer Network example at
          // https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind
          bind = function bind(obj) {
            var args = slice.call(arguments, 1),
              self = this,
              nop = function() {
              },
              bound = function() {
                return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));
              };
            nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined
            bound.prototype = new nop();
            return bound;
          };
          proto.bind = bind;
        }
    });
}

Now the above code will get us a screenshot same as we were getting before AND debug will not show a TypeError so from the surface, everything appears to work. Progress has been made.

Unfortunately, all of the image icons [logo, etc] are still not loading correctly. We see some sort of 3W icon not sure where thats from.

Thanks for the help @ArtjomB

enter image description here

Community
  • 1
  • 1
MrPizzaFace
  • 7,807
  • 15
  • 79
  • 123
  • You also have a bind issue. Here are the drop in solutions for [Casper](http://stackoverflow.com/questions/25359247/casperjs-bind-issue/25359714#25359714) and for [PhantomJS](http://stackoverflow.com/questions/26382041/phantomjs-page-content-isnt-retrieving-the-page-content/26383058#26383058). – Artjom B. Oct 23 '14 at 06:25
  • Hey thanks, I suspect the bind issue is https related. Tested with SetTimeout to 10 seconds with same result... – MrPizzaFace Oct 23 '14 at 06:34
  • The code inside of `page.onInitialized` adds the `bind` shim so that you won't get a TypeError on the page and the page JS functions properly (if you further need to do something on the page). – Artjom B. Oct 23 '14 at 21:16
  • Those logos are icon fonts and looks like there is an issue with them. Refer to https://github.com/ariya/phantomjs/issues/10592 – Ravi Kadaboina Jan 22 '15 at 17:42
  • 2
    Thank you for such a great answer, For most of the website i am getting screenshot but for some websites for example .... http://www.practo.com and http://www.iitr.ac.in I am geting still white screenshot or 403 error.....can you please help with that? – Ajeet Lakhani Apr 12 '15 at 08:52
  • @ajeetlakhani 403 is most probably due to web app recognizing a bot and not a "real user" browser - try changing the user agent to something "real". – cprn May 04 '16 at 12:18