-2

I created a simple scraper using cheerio and request client but it doesn't work the way I want.

First I see all the "null returned, do nothing" messages on the terminal and then see the names, so I think it first checks all the urls that returns a null, then non-nulls.

I want it to run in the right order, from 1 to 100.

app.get('/back', function (req, res) {
  for (var y = 1; y < 100; y++) {
    (function () {
      var url = "example.com/person/" + y +;
      var options2 = {
        url: url,
        headers: {
          'User-Agent': req.headers['user-agent'],
          'Content-Type': 'application/json; charset=utf-8'
        }
      };
      request(options2, function (err, resp, body) {
        if (err) {
          console.log(err);
        } else {
          if ($ = cheerio.load(body)) {
            var links = $('#container');
            var name = links.find('span[itemprop="name"]').html(); // name
            if (name == null) {
              console.log("null returned, do nothing");
            } else {
              name = entities.decodeHTML(name);
              console.log(name);
            }
          }
          else {
            console.log("can't open");
          }
        }
      });
    }());
  }
});
salep
  • 1,332
  • 9
  • 44
  • 93
  • What is "the right order"? –  Nov 19 '15 at 22:54
  • @Houseman From 1 to 100. – salep Nov 19 '15 at 22:55
  • Your loop won't wait for the first request to return before the second one fires. Javascript is asynchronous like that. There are lots of techniques you can search to make it wait. [For example](https://zackehh.com/handling-synchronous-asynchronous-loops-javascriptnode-js/). Or you can use a library like [Q](https://github.com/kriskowal/q). [Also, this](http://stackoverflow.com/questions/15162049/javascript-synchronizing-foreach-loop-with-callbacks-inside) –  Nov 19 '15 at 23:02
  • You apparently left out the one part of this code that actually makes each request do something different as you are not using `id` or `y` in your request so thus this just does the exact same thing 99 times. Please include that different part so we can more properly give you an alternative. And, you have to tell us how you want it to behave. Do you want all 99 requests to be sent in parallel as long as you get all 99 results back in order? Or, so you want it to send one request, wait for that response, then send the next, etc... – jfriend00 Nov 19 '15 at 23:09
  • I edited question details, @jfriend00 . 10 parallel requests. Each request returns a different value (null or a unique name) I set maxSockets to 10, by the way. – salep Nov 19 '15 at 23:27
  • 10 parallel requests? Your code is trying to do 99. – jfriend00 Nov 19 '15 at 23:51

1 Answers1

3

If you are not using promises and you want to run the requests sequentially, then this is a common design pattern for running a sequential async loop:

app.get('/back', function (req, res) {
    var cntr = 1;

    function next() {
        if (cntr < 100) {
            var url = "example.com/person/" + cntr++;
            var options2 = {
                url: url,
                headers: {
                    'User-Agent': req.headers['user-agent'],
                    'Content-Type': 'application/json; charset=utf-8'
                }
            };
            request(options2, function (err, resp, body) {
                if (err) {
                    console.log(err);
                } else {
                    if ($ = cheerio.load(body)) {
                        var links = $('#container');
                        var name = links.find('span[itemprop="name"]').html(); // name
                        if (name == null) {
                            console.log("null returned, do nothing");
                        } else {
                            name = entities.decodeHTML(name);
                            console.log(name);
                        }
                    } else {
                        console.log("can't open");
                    }
                    // do the next iteration
                    next();
                }
            });
        }
    }
    // start the first iteration
    next();
});

If you want to make all the requests in parallel (multiple requests in flight at the same time) which will be a faster end result and then accumulate all the results in order at the end, you can do this:

// create promisified version of request()
function requestPromise(options) {
    return new Promise(function(resolve, reject) {
        request(options2, function (err, resp, body) {
            if (err) return reject(err);
            resolve(body);
        });
    });
}

app.get('/back', function (req, res) {
    var promises = [];
    var headers = {
        'User-Agent': req.headers['user-agent'],
        'Content-Type': 'application/json; charset=utf-8'
    };
    for (var i = 1; i < 100; i++) {
        promises.push(requestPromise({url: "example.com/person/" + i, headers: headers}));
    }
    Promise.all(promises).then(function(data) {
        // iterate through all the data here
        for (var i = 0; i < data.length; i++) {
            if ($ = cheerio.load(data[i])) {
                var links = $('#container');
                var name = links.find('span[itemprop="name"]').html(); // name
                if (name == null) {
                    console.log("null returned, do nothing");
                } else {
                    name = entities.decodeHTML(name);
                    console.log(name);
                }
            } else {
                console.log("can't open");
            }
        }
    }, function(err) {
        // error occurred here
    });

});
jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • @salep - I added a parallel version using promises. – jfriend00 Nov 20 '15 at 00:04
  • May be worthwhile to use node-fetch, since it's becoming standard, and is already promise based... – Tracker1 Nov 20 '15 at 17:44
  • @Tracker1 - Yes, that would work too. I don't know what you mean about "becoming standard". It's not built into node.js or on any standards-track in node.js that I know of unless you're talking about the fetch API in the browser. node-fetch is one of many, many libraries available for fetching data. [request-promise](https://github.com/request/request-promise) is also available as a promisified version of request. – jfriend00 Nov 20 '15 at 21:54