0

I'm trying to create a node js web scraper. The overall operation of this scraper is:

  1. Grab array of URLs from database. Return in a promise.
  2. Send Requests to URL from database and scrape data. Return in a promise
  3. Insert scraped data into database.

I want to be able to compose my steps like so.

getUrls()
  .then(scrapeData)
  .then(insertData);

However, I'm finding that in order to do this, I must wait for ALL data from each url to resolve within step 2 (using promise.all) in order to proceed to the next chained event.

This could pose problems because I could be sending requests to thousands of URLS and if one fails during promise.all, all of the data gathered is then lost.

I would much rather have each function operate like so:

getUrls() //grab array of all urls (could be thousands)
  .then(scrapeData) // for each url scrape data and immediately proceed to chained function
  .then(insertData);

In short, is there a procedural way to iterate through the chain of a promise and control when data is to be waited for?

My Code:

var express = require('express');
var app = express();
var request = require('request');
var cheerio = require('cheerio');

app.get('/', (req, res) => {

    var sql = require("mssql");

    // config for your database
    var config = {
        user: '',
        password: '',
        server: '',
        database: '',
        options: {
            encrypt: false // Use this if you're on Windows Azure 
        }
    } 

    const getSkus = () => {
        var promise = new Promise((resolve, reject) => {
            sql.connect(config, (err) => {

                if (err) console.log(err);

                // create Request object
                var request = new sql.Request();

                // query to the database and get the records
                request.query('SELECT URL FROM PRODUCTS, (err, recordset) => {

                    if (err) {
                        console.log("There was an error executing the SQL statement: " + err)
                        reject(err);
                    } else{
                    resolve(recordset);
                    }
                });
            });
         });
        return promise;
    }

    const urlGen = (skus) => {
        var base_url = 'http://somesite.com/search/?q='
        var urls = [];

        skus.forEach((sku) =>{
            let code = sku.Code;
            let mpn = sku.MPN;
            let url = base_url + mpn;
            urls.push(url);
        });
        return urls;
    }

    const makeRequests = (urls) => {
        var promises = [];

        urls.forEach((url) => {
            var promise = new Promise((resolve, reject) => {
                request(url, (err, response, html) => {
                    if(!err && response.statusCode == 200){
                            //do scraping here
                            }
                            catch(err){
                                reject(err);
                                console.log('Error occured during data scraping:');
                            }
                            resolve(jsontemp);
                        }
                        else{
                            reject(err);
                        }
                });
            });
            promises.push(promise);
        });

        return Promise.all(promises);
    }

    getSkus()
        .then(urlGen)
        .then(makeRequests)
        .catch((e) => console.log(e));

});

var server = app.listen(5000, function () {
    console.log('Server is running..');
});
Query
  • 720
  • 2
  • 7
  • 23
  • Please show us your code so we can advise you specifically on how to solve your exact problem rather than try to answer generically. Questions about your code should include the relevant code in the question. We could probably show you exactly how to change your code. – jfriend00 Dec 14 '16 at 18:32
  • If you want to process each URL as it comes in, then you can't have the first step be `getUrls()` that gets all URLs. You seem to already know that, so now I have no idea what you're actually asking for help with. Redesign the process to get an URL and process it immediately. – jfriend00 Dec 14 '16 at 18:34
  • Code has been included. I didn't want to include code so that it would be a general question as stackoverflow requires questions not to be so user specific. – Query Dec 14 '16 at 18:56
  • This doesn't sound like good practice, but you could wrap each promise inside another promise and resolve both in `.then` and `.catch`, that way all promises will always resolve – martskins Dec 14 '16 at 18:56
  • You have it wrong if you think stackoverflow doesn't want your specific user code. That's the only way we can truly see your exact problem and offer a correct and specific solution. General tutorials on general, non-specific problems are for other web-sites, not for here. – jfriend00 Dec 14 '16 at 19:05
  • Do you require that the URLs are processed in order or `insertData()` is called in order or can you just run a bunch of URLs in process and let them complete whenever they do? – jfriend00 Dec 14 '16 at 19:18
  • Order does not matter. – Query Dec 14 '16 at 19:22

2 Answers2

1

As you see to already know, your proposed scheme:

getUrls()
  .then(scrapeData)
  .then(insertData);

is designed to run each step of the process on the entire array before continuing to the next step. But, it sounds like you want to process each URL as soon as you can rather than waiting for all the URLs to finish each step.

However, I'm finding that in order to do this, I must wait for ALL data from each url to resolve within step 2 (using promise.all) in order to proceed to the next chained event.

Yes, that's what the code above is designed to do.

This could pose problems because I could be sending requests to thousands of URLS and if one fails during promise.all, all of the data gathered is then lost.

If you want promise chained iterations to continue even if there are errors, then you have to catch errors locally in the iteration so they are not automatically propagated upwards which would stop the promise chain. This allows the whole iteration to complete even if some errors occur. Or, you can use a substitute for Promise.all() which is often called settle() which waits for all promises to be settled (whether rejected or resolved) and then returns you the results of everything. You can see how an implementation of settle works here, though the concept is similar to what my code below shows.


If I understand your code correctly, getSkus() is an async function that returns a list of skus from a database and getURLs() is a synchronous function that just processes the skus into URLs. So, each of those are single operations that can't really be broken up into pieces so we'll start with that as the beginning of the operation.

As such, you'd could do something like this:

const Promise = require('bluebird');
const request = Promise.promisifyAll(require('request'), {multiArgs: true});

Promise.map(getSkus().then(getURLs), function(url) {
    // This will only ever return a promise that resolves (all rejections are caught locally)
    // so that Promise.map() will not stop when an error occurs, but will
    // process all URLs
    return request.getAsync(url).then(scrapeData).then(insertData).catch(function(err) {
        // log the error, but conciously let the promise iteration continue (without err)
        console.err(err);
        // put error in the results in case caller wants to see all errors
        return err;
    });
}, {concurrency: 10}).then(function(results) {
    // results will be an array of whatever insertData returns
    // of for any step in the iteration that had an error, it will be
    // some type of Error object
});

In this implementation, scrapeData() and insertData() are adapted to handle the arguments they are passed here.

This uses Bluebird to promisify the request() module and to iterate your array of URLs with some concurrency control (to prevent launching a ton of simultaneous requests), though you could use standard ES6 promises by just writing more code yourself to control concurrency and to promisify the request() module.


Using only ES6 standard promises (and without any concurrency control to limit how many requests are in-flight at once), you can do this:

const request = require('request');

// manually promisify request.get()
function requestAsync(url) {
    return new Promise(function(resolve, reject) {
        request.get(url, function(err, response, body) {
            if (err) {
                reject(err);
            } else {
                resolve(body);
            }
        });
    });
}

getSkus().then(getURLs).then(function(urls) {
    return Promise.all(urls.map(function(url) {
        return requestAsync(url).then(scrapeData).then(insertData).catch(function(err) {
            // log the error, but conciously let the promise iteration continue (without err)
            console.err(err);
            // put error in the results in case caller wants to see all errors
            return err;
        });
    ));
}).then(function(results) {
    // results will be an array of whatever insertData returns
    // of for any step in the iteration that had an error, it will be
    // some type of Error object
});

Implementing your own concurrency control is a bit more code (which is why I used Bluebird in my first example since it has that built in).

Community
  • 1
  • 1
jfriend00
  • 683,504
  • 96
  • 985
  • 979
-2

The secret is to execute next request call (assuming it makes an HTTP request) just after the previous one has been execute.

Using the native ES6 Promise implementation

One possible implementation would be:

function getUrls () {
  var urls = ['http://google.com', 'http://amazon.com', 'http://microsoft.com']
  var bodies = []

  // Initial value
  var promise = Promise.resolve()

  urls.forEach(function (url) {
    // We assign a new promise that will resolve only after the previous one has finished
    promise = promise.then(function () {
      return request(url)
    }).then(function (body) {
      bodies.push(body)
    })
  })

  // Then we return a promise that is the result of all fetched urls
  return promise.then(function () {
    return bodies
  })
}

However, I'd recommend you to use bluebird module, which has some really convenient methods to work with collections of promises.

Using the bluebird Promise implementation

That would actually just be:

var Promise = require('bluebird')
function getUrls () {
  var urls = ['http://google.com', 'http://amazon.com', 'http://microsoft.com']
  return Promise.resolve(urls).map(function (url) {
    return request(url)
  })
}

If you actually want the operation to be retriable, you can make some modifications.

var Promise = require('bluebird')

function getUrls () {
  var urls = ['http://google.com', 'http://amazon.com', 'http://microsoft.com']
  return Promise.resolve(urls).map(function (url) {
    return retriableRequest(url, 3)
  })
}

function retriableRequest (url, retries) {
  return request(url).catch(function (error) {
    if (retries <= 0) throw error

    return retriableRequest(url, retries - 1)
  })
}
thalesmello
  • 3,301
  • 3
  • 20
  • 20
  • I don't think this solves the problem. If one of the promises chained to the original `promise` var fails then the whole chain fails. – martskins Dec 14 '16 at 18:55
  • What exactly do you want to do in a situation of error? That's somewhat easy to accomplish with promises. You can, for example, retry requests. Take a look at the example I edited in the answer above. – thalesmello Dec 14 '16 at 22:10