I've been trying to figure this out for a couple days now but haven't been able to achieve it.
There's this web page were I need to scrap all records available on it, I've noticed that if I modify the pagination link with firebug or the browser's inspector I can get all the records I need, for example, this is the original link:
<a href="javascript:gReport.navigate.paginate('paginator_min_row=16max_rows=15rows_fetched=15')">
If I modify that link like this
<a href="javascript:gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000')">
And then click on the pagination button on the browser (the very same that contains the link I've just changed) I'm able to get all records I need from that site (most of the time "rows" doesn't get any bigger than 4000, I use 5000 just in case)
Since I have to process that file by hand every single day I thought that maybe I could automatize the process with PhantomJS and get the whole page on a single run without looking for that link then changing it, so in order to modify the pagination link and getting all records I'm using the following code:
var page = require('webpage').create();
var fs = require('fs');
page.open('http://testingsite1.local', function () {
page.evaluate(function(){
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').first().attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id','clickit');
$('#clickit')[0].click();
});
page.render('test.png');
fs.write('test.html', page.content, 'w');
phantom.exit();
});
Notice that there are TWO pagination links on that website, because of that I'm using jquery's ".first()" to choose only the first one.
Also since the required link doesn't have any identificator I select it using its own link then change it to what I need, and lastly I add the "clickit" ID to it for later calling.
Now, this are my questions:
I'm, not exactly sure why it isn't working, if I run the code it fetches the first page only, after examining the requested page source code I do see the href link has been changed to what I want but it just doesn't get called, I have two different theories on what might be wrong
The modified href isn't getting "clicked" so the page isn't getting updated
The href does get clicked, but since the page takes a few seconds to load all results dynamically I only get to dump the first page Phantomjs gets to see
What do you guys think about it?
[UPDATE NOV 6 2015] Ok, so the answers provided by @Artjomb and @pguardiario pointed me in a new direction:
- I needed more debugging info on what was going on
- I needed to call gReport.navigate.paginate function directly
Sadly I simply lack the the experience to properly use PhantomJS, several other samples indicated that I could achieve what I wanted with CasperJS, so I tried it, this is what I produced after a couple of hours
var utils = require('utils');
var fs = require('fs');
var url = 'http://testingsite1.local';
var casper = require('casper').create({
verbose: true,
logLevel: 'debug'
});
casper.on('error', function(msg, backtrace) {
this.echo("=========================");
this.echo("ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.on("page.error", function(msg, backtrace) {
this.echo("=========================");
this.echo("PAGE.ERROR:");
this.echo(msg);
this.echo(backtrace);
this.echo("=========================");
});
casper.start(url, function() {
var url = this.evaluate(function() {
$('a[href="javascript:gReport.navigate.paginate(\'paginator_min_row=16max_rows=15rows_fetched=15\')"]').attr('href', 'javascript:gReport.navigate.paginate(\'paginator_min_row=1max_rows=5000rows_fetched=5000\')').attr('id', 'clicklink');
return gReport.navigate.paginate('paginator_min_row=1max_rows=5000rows_fetched=5000');
});
});
casper.then(function() {
this.waitForSelector('.nonexistant', function() {
// Nothing here
}, function() {
//page load failed after 5 seconds
this.capture('screen.png');
var html = this.getPageContent();
var f = fs.open('test.html', 'w');
f.write(html);
f.close();
}, 50000);
});
casper.run(function() {
this.exit();
});
Please be gentle as I know this code sucks, I'm no Javascript expert and in fact I know very little of it, I know I should have waited an element to appear but it simply didn't work on my tests as I was still getting the page without update from the AJAX request.
In the end I waited a long time (50 seconds) for the AJAX request to show on page and then dump the HTML
Oh! and calling the function directly did work great!