1

I've been trying to figure this out for a couple days now but haven't been able to achieve it.

There's this web page were I need to scrap all records available on it, I've noticed that if I modify the pagination link with firebug or the browser's inspector I can get all the records I need, for example, this is the original link:

<a href="javascript:gReport.navigate.paginate('paginator_min_row=16max_rows=15rows_fetched=15')">

If I modify that link like this

<a href="javascript:gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000')">

And then click on the pagination button on the browser (the very same that contains the link I've just changed) I'm able to get all records I need from that site (most of the time "rows" doesn't get any bigger than 4000, I use 5000 just in case)

Since I have to process that file by hand every single day I thought that maybe I could automatize the process with PhantomJS and get the whole page on a single run without looking for that link then changing it, so in order to modify the pagination link and getting all records I'm using the following code:

var page = require('webpage').create();
var fs = require('fs');
page.open('http://testingsite1.local', function () {
    page.evaluate(function(){
        $('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
    $('#clickit')[0].click();
    });

    page.render('test.png');
    fs.write('test.html', page.content, 'w');
    phantom.exit();
});

Notice that there are TWO pagination links on that website, because of that I'm using jquery's ".first()" to choose only the first one.

Also since the required link doesn't have any identificator I select it using its own link then change it to what I need, and lastly I add the "clickit" ID to it for later calling.

Now, this are my questions:

I'm, not exactly sure why it isn't working, if I run the code it fetches the first page only, after examining the requested page source code I do see the href link has been changed to what I want but it just doesn't get called, I have two different theories on what might be wrong

  1. The modified href isn't getting "clicked" so the page isn't getting updated

  2. The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see

What do you guys think about it?


[UPDATE NOV 6 2015] Ok, so the answers provided by @Artjomb and @pguardiario pointed me in a new direction:

  1. I needed more debugging info on what was going on
  2. I needed to call gReport.navigate.paginate function directly

Sadly I simply lack the the experience to properly use PhantomJS, several other samples indicated that I could achieve what I wanted with CasperJS, so I tried it, this is what I produced after a couple of hours

var utils = require('utils');
var fs = require('fs');
var url = 'http://testingsite1.local';

var casper = require('casper').create({
  verbose: true,
  logLevel: 'debug'
});

casper.on('error', function(msg, backtrace) {
  this.echo("=========================");
  this.echo("ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.on("page.error", function(msg, backtrace) {
  this.echo("=========================");
  this.echo("PAGE.ERROR:");
  this.echo(msg);
  this.echo(backtrace);
  this.echo("=========================");
});

casper.start(url, function() {
  var url = this.evaluate(function() {
    $('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id', 'clicklink');
    return gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000');
  });
});

casper.then(function() {
  this.waitForSelector('.nonexistant', function() {
    // Nothing here
  }, function() {
    //page load failed after 5 seconds
    this.capture('screen.png');
    var html = this.getPageContent();
    var f = fs.open('test.html', 'w');
    f.write(html);
    f.close();
  }, 50000);
});

casper.run(function() {
  this.exit();
});

Please be gentle as I know this code sucks, I'm no Javascript expert and in fact I know very little of it, I know I should have waited an element to appear but it simply didn't work on my tests as I was still getting the page without update from the AJAX request.

In the end I waited a long time (50 seconds) for the AJAX request to show on page and then dump the HTML

Oh! and calling the function directly did work great!

dvisor
  • 109
  • 1
  • 7
  • Doesn't it make more sense to do `gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000')` directly? – pguardiario Nov 06 '15 at 00:56
  • @pguardiario That's a great idea!, in fact you have pointed me on a new direction: Currently I added **CasperJS** on top and I'm experiencing certain degree of success, I'll update the main question with it – dvisor Nov 07 '15 at 03:59

1 Answers1

0
  1. The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see

It's easy to check whether it's that by wrapping the render, write and exit calls in setTimeout and trying different timeouts:

page.open('http://testingsite1.local', function () {
    page.evaluate(function(){
        $('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
        $('#clickit')[0].click();
    });

    setTimeout(function(){
        page.render('test.png');
        fs.write('test.html', page.content, 'w');
        phantom.exit();
    }, 5000);
});

If it's really just a timeout issue, then you should use the waitFor() function to wait for a specific condition like "all elements loaded" or "x elements of that type are loaded".

  1. The modified href isn't getting "clicked" so the page isn't getting updated

This is a little trickier. You can listen to the onConsoleMessage, onError, onResourceError, onResourceTimeout events (Example) and see if there are errors on the page. Some of those errors are fixable by the stuff you can do in PhantomJS: Function.prototype.bind not available or HTTPS site/resources cannot be loaded.

There are other ways to click something that are more reliable such as this one.

Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222