20

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.

I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
iwek
  • 1,608
  • 5
  • 16
  • 31

4 Answers4

17

This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.

Long story short, here is my gist. You need the cache.js and mimetype.js files: https://gist.github.com/bshamric/4717583

//for this to work, you have to call phantomjs with the cache enabled:
//usage:  phantomjs --disk-cache=true test.js

var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');

//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';

var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };

//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
  //I only cache images, but you can change this
    if(response.contentType.indexOf('image') >= 0)
    {
        cache.includeResource(response);
    }
};

//when the page is done loading, go through each cachedResource and do something with it, 
//I'm just saving them to a file
page.onLoadFinished = function(status) {
    for(index in cache.cachedResources) {
        var file = cache.cachedResources[index].cacheFileNoPath;
        var ext = mimetype.ext[cache.cachedResources[index].mimetype];
        var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
        fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
    }
};

page.open(url, function () {
    page.render('saved/google.pdf');
    phantom.exit();
});

Then when you call phantomjs, just make sure the cache is enabled:

phantomjs --disk-cache=true test.js

Some notes: I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.

brandon
  • 1,230
  • 3
  • 13
  • 31
  • 1
    how do you decmpress? – KJW Oct 15 '13 at 23:37
  • would really like to find out how you decompressed. Did you manage it ? – Vic Seedoubleyew Feb 28 '15 at 14:43
  • You sir, are a trooper. Thanks for this. – Authman Apatira May 14 '15 at 03:12
  • not working anymore, phantomjs using sqlite for cache – jmp Mar 11 '16 at 13:09
  • Looks great, but in my case i have a dynamic page generated by many calls to the same url with different POST parameters: once it returns the html container, then a PDF file, then an image, ... cache.js getUrlCacheFilename() seems to return always the same cache filename (7/3kjh55ig.d) - that actually doesn't exist – j.c Apr 24 '17 at 10:22
16

I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.

So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py

from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO

class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):

        def do_request(self, data):
            data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
            return data

        def do_response(self, data):
            #print '<< %s' % repr(data[:100])
            request_line, headers_alone = data.split('\r\n', 1)
            headers = Message(StringIO(headers_alone))
            print "Content type: %s" %(headers['content-type'])
            if headers['content-type'] == 'text/x-comma-separated-values':
                f = open('data.csv', 'w')
                f.write(data)
            print ''
            return data

if __name__ == '__main__':
    proxy = AsyncMitmProxy()
    proxy.register_interceptor(DebugInterceptor)
    try:
        proxy.serve_forever()
    except KeyboardInterrupt:
        proxy.server_close()

Then I fire it up

python proxy.py

Next I execute phantomjs with the proxy specified...

phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js

You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.

One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.

So my power company won't offer me an API? Fine! We do it the hard way!

Xedecimal
  • 3,153
  • 1
  • 19
  • 22
2

Did not realize I could grab the source from the document object like this:

casper.start(url, function() {
    var js = this.evaluate(function() {
        return document; 
    }); 
    this.echo(js.all[0].outerHTML); 
});

More info here.

iwek
  • 1,608
  • 5
  • 16
  • 31
1

You can use Casper.debugHTML() to print out contents of a HTML resource:

var casper = require('casper').create();

casper.start('http://google.com/', function() {
    this.debugHTML();
});

casper.run();

You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)

NiKo
  • 11,215
  • 6
  • 46
  • 56
  • 1
    Thanks NiKo, and I guess I wasn't clear, but I'm looking for all the other resources, not the html page. I want to store the external css or js file in a var, the contents of these resources, is this possible? – iwek Jul 18 '12 at 12:33
  • just make sure you set the protocol right (ie http vs https).. it took me a while to figure out the site i was trying to open was redirecting from http to https.. and that choked casperjs (bug?) – abbood Apr 09 '13 at 15:25
  • @iwek See this link to know more about how to save the resource to disk: http://stackoverflow.com/questions/24582307/how-to-save-the-current-webpage-with-casperjs-phantomjs as answered by http://stackoverflow.com/users/1816580/artjom-b – iChux Jan 08 '15 at 08:05