1

I am tring to scrape a few sites. Here is my code:

for (var i = 0; i < urls.length; i++) {
    url = urls[i];
    console.log("Start scraping: " + url);

    page.open(url, function () {
        waitFor(function() {
            return page.evaluate(function() {
                return document.getElementById("progressWrapper").childNodes.length == 1;
            });

        }, function() {
            var price = page.evaluate(function() {
                // do something
                return price;
            });

            console.log(price);
            result = url + " ; " + price;
            output = output + "\r\n" + result;
        });
    });

}
fs.write('test.txt', output);
phantom.exit();

I want to scrape all sites in the array urls, extract some information and then write this information to a text file.

But there seems to be a problem with the for loop. When scraping only one site without using a loop, all works as I want. But with the loop, first nothing happens, then the line

console.log("Start scraping: " + url);

is shown, but one time too much. If url = {a,b,c}, then phantomjs does:

Start scraping: a 
Start scraping: b 
Start scraping: c 
Start scraping:

It seems that page.open isn't called at all. I am newbie to JS so I am sorry for this stupid question.

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
ORspecialist
  • 13
  • 1
  • 3
  • Trying adding `console.log("Scraping " + urls.length + " pages.")` prior to the scraping, and potentially surround the URL in quotes when you log it. You may have empty (or whitespace-only) input in the list. – ssube Oct 31 '14 at 19:09

1 Answers1

4

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You're overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this:

page.open(url, function () {
    waitFor(function() {
       // something
    }, function() {
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                // and so on
            });
        });
    });
});

But this is tedious. There are utilities that can help you with writing nicer code like async.js. You can install it in the directory of the phantomjs script through npm.

var async = require("async"); // install async through npm
var tests = urls.map(function(url){
    return function(callback){
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                callback();
            });
        });
    };
});
async.series(tests, function finish(){
    fs.write('test.txt', output);
    phantom.exit();
});

If you don't want any dependencies, then it is also easy to define your own recursive function (from here):

var urls = [/*....*/];

function handle_page(url){
    page.open(url, function(){
        waitFor(function() {
           // something
        }, function() {
            next_page();
        });
    });
}

function next_page(){
    var url = urls.shift();
    if(!urls){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page();
Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222