3

I was recently building a scraper module to get some information with nodejs until I encountered this "little" problem. The modules that I'm using are cheeriojs and request. Actually the module works like a charm if I call only one method at a time. It contains three function and only two of them are exported, this is the code:

'use strict';

var request = require('request'),
    cheerio = require('cheerio'),
    counter = 0;

function find(term, cat, callback) {
  // All the check for the parameters
  scrape("http://.../search.php?search=" + encodeURIComponent(term), cat, callback);
}

function last(cat, callback) {
  // All the check for the parameters
  scrape("http://google.com/", cat, callback);
}

function scrape(url, cat, callback) {
  request(url, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(body);
      var result = [];

      var items = $('.foo, .foo2').filter(function() {
        // Condition to filter the resulted items
      });

      items.each(function(i, row) {
        // Had to do another request inside here to scrape other information
        request( $(".newpagelink").attr("href"), function(error, response, body) {
          var name = $(".selector").text(),
              surname = $(".selector2").text(),
              link = cheerio.load(body)('.magnet').attr('href'); // This is the only thing that I'm scraping from the new page, the rest comes from the other "cheerio.load"
        // Push an object in the array
        result.push( { "name": name, "surname": surname, "link": link } );

          // To check when the async requests are ended
          counter++;
          if(counter == items.length-1) {
            callback(null, result);
          }
        });
      });
    }
  });
}

exports.find = find;
exports.last = last;

The problem now, as I was saying, is that if I create a new node script "test.js" and I call only last OR find, it works perfectly! But if I call both the methods consecutively like this:

var mod = require("../index-tmp.js");
mod.find("bla", "blabla", function(err, data) {
    if (err) throw err;
    console.log(data.length + " find");
});
mod.last(function(err, data) {
  console.log(data.length + " last");
});

The results are completely messed up, sometimes the script doesn't even print something, other times print the result of only "find" or "last", and other times returns a cheeriojs error (I won't add here to not mess you up, because probably it's my script's fault). I thought also to repeat the same function two times for both the methods but nothing, the same problems occur... I don't know what else to try, I hope you'll tell me the cause of this behavior!

Giacomo Cerquone
  • 2,352
  • 5
  • 24
  • 33
  • Btw, `counter == items.length-1` looks suspicious as well. Why not wait for all items? Also you're missing error handling in the inner `request` callback, and the `callback` will never be called if `items.length == 0` – Bergi Dec 18 '15 at 10:08
  • @Bergi items.length-1 because the last one is an empty box that I don't have to scrape. The error handling, as you can see also from the other functions, are not included in this post to not make it too long, I just forgot the remove also the first one. Anyway it works putting the counter inside... I wasted 2 hours of my life on this problem .-. Thank you very much – Giacomo Cerquone Dec 18 '15 at 10:13
  • 2
    From the code you've shown to us, you still are sending a `request` for the last box… And notice that you cannot be sure which of them is ignored because arriving last. – Bergi Dec 18 '15 at 10:16
  • @Bergi Yes, I already figured out that I cannot know which will be the last item, anyway I wanted to solve this problem first, and now I can fix the thing of the last item, btw should not be hard, I've just to analyze the page more and search for some singularity in the nested html tags in order to not scrape it – Giacomo Cerquone Dec 18 '15 at 10:19

2 Answers2

3

Your counter variable is global, not specific to each scrape call. It wouldn't work if you called find twice at the same time either, or last.

Move the declaration and initialisation of var counter = 0; into the scrape function, or even better right next to the result and items declarations.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
0

From scanning your code quickly, this is probably due to the variable counter being global. These are asynchronous functions, so they will both act on counter at the same thing. Move the declaration inside of the scrape function.

If you need more information about asynchronous programming, refer to Felix's great answer in this question.

Community
  • 1
  • 1
hargasinski
  • 841
  • 5
  • 14