0

I am currently trying to write a function that takes in user input and returns the top subreddit that the user's comment might have belonged to based on how many times the word occurred in training data. I have a database containing frequencies of words in different subreddits, which I am making GET requests to for each subreddit and for every word in the user's input.

This adds up to many get requests, as I have over 7000 subreddits in my database. I am currently making a request-promise (shallow) request to get the list of all subreddits. Then, for each subreddit, I loop over each word in the user's input, make another request-promise object and add that to an array of promises.

Once all the request-promise objects are added, I wait until they are all resolved using Promise.all, where I then try to print out the array of word frequencies for the given subreddit, but I get an 'Error: connect EMFILE' message.

According to another post on stack overflow, this means that I have too many sockets open, but I'm confused as to how that could happen. From my understanding, wouldn't it only ever have up to user_words.length possible connections open at a time, since those are the requests that are being done while the Promise.all waits to resolve? I don't see how the connections are not being closed.

Thanks in advance for any help!

function getBestSubreddit(messageText) {
  var user_words = parse_message(messageText);
  var top_subreddit = "";
  var top_score = Number.MIN_SAFE_INTEGER;
  rp(dbUrl + '/.json?shallow=true').then(function(res) {
    res = JSON.parse(res);
    for (var subreddit in res) {
      if (res.hasOwnProperty(subreddit)) {
        var score = 0.0;
        var promises = []
        for (var i = 0; i < user_words.length; i++) {
          promises.push(rp(dbUrl + '/' + subreddit + '/word_freqs/' + user_words[i] + '.json'));
        }
        Promise.all(promises).then(function(values) {
          console.log(values);
        }, function(err) {
          console.log(err);
        });
      }
    }
  }).catch(function(err) {
    console.log(err);
  })
  return top_subreddit;
}
  • Unlike your title proposes, the problem isn't caused by `request-promise`. The problem is caused by your two nested loops trying to do tons of requests in parallel rather than having your code controll how many are in-flight at the same time. – jfriend00 Apr 01 '17 at 15:26
  • Some related answers: [How to make millions of parallel http requests from nodejs app?](http://stackoverflow.com/questions/38268371/how-to-make-millions-of-parallel-http-requests-from-nodejs-app/38272107#38272107) and [In Node js. How many simultaneous requests can I send with the “request” package](http://stackoverflow.com/questions/36611890/in-node-js-how-many-simultaneous-requests-can-i-send-with-the-request-package/36612175#36612175) and [Making a million requests](http://stackoverflow.com/questions/34802539/node-js-socket-explanation/34802932#34802932). – jfriend00 Apr 01 '17 at 15:29
  • Personally, I'd probably use the concurrency option in Bluebird's `Promise.map()` to manage how many requests were in-flight at the same time. – jfriend00 Apr 01 '17 at 15:32

2 Answers2

0

From my understanding, wouldn't it only ever have up to user_words.length possible connections open at a time, since those are the requests that are being done while the Promise.all waits to resolve? I don't see how the connections are not being closed.

No, this is not correct. You have two nested for loops so you can have as many as user_words.length * how many subreddits there are all open at the same time. Remember, rp() and Promise.all() do not block so you run the nested for loops to completion launching every single connection before any of the responses are processed.

It also looks like you're expecting to somehow synchronously return a result with the line of code return top_subreddit. You can't do that either. You should return a promise that will eventually resolve to the desired result.

From my understanding, wouldn't it only ever have up to user_words.length possible connections open at a time, since those are the requests that are being done while the Promise.all waits to resolve? I don't see how the connections are not being closed.

This is not the right understanding of Promise.all(). Promise.all() is not blocking. It doesn't "wait" until all the promises are resolved before your code continues to exit. It behaves asynchronously. Your code continues to execute other iterations of your for loop and Promise.all() calls it's .then() handler sometime in the future when all the promises you passed it have finished. The other iterations of your for loop continue to run and pile up more sockets.

I think the easiest way to approach this is the make an array of URLs that you want to process and then use one of the asynchronous libraries that already has a function built-in to allow you to run a max of N async operations in-flight at the same time. Since your code is promise-based, I would choose Bluebird's Promise.map() to process the list of URLs.

var Promise = require('bluebird');

function getBestSubreddit(messageText) {
  var user_words = parse_message(messageText);
  var top_subreddit = "";
  var top_score = Number.MIN_SAFE_INTEGER;
  return rp(dbUrl + '/.json?shallow=true').then(function(res) {
    res = JSON.parse(res);
    // build a list of URLs to process
    var urls = [];
    for (var subreddit in res) {
      if (res.hasOwnProperty(subreddit)) {
        for (var i = 0; i < user_words.length; i++) {
          urls.push(dbUrl + '/' + subreddit + '/word_freqs/' + user_words[i] + '.json');
        }
      }
    }
    // 
    return Promise.map(urls, function(url) {
        return rp(url);
    }, {concurrency: 20}).then(function(allResults) {
        // do any final processing of allResults here and return that value
        // to become the resolved result of the returned promise
    });
  }
}

getBestSubreddit(someText).then(function(result) {
    // process result here
}).catch(function(err) {
    // handle error here
});

I set the number of concurrent requests to 20 in this example. You can experiment whether changing that to a higher or lower number improves your throughput. The ideal number is dependent upon a number of things including your local execution environment, the amount of data you are requesting and the bandwidth you have and on the target host you are making requests from and how it handles simultaneous requests. You may also need to be concerned about rate limiting by the target if you make too many requests too quickly.

Some other related answers:

How to make millions of parallel http requests from nodejs app?

In Node js. How many simultaneous requests can I send with the "request" package

Making a million requests


It still is not clear to me from your question exactly what result you're trying to get, but here's a version that collects all the data possible. You end up with an array of objects of this form {result: result, subreddit: subreddit, word: word} where result is the result of your rp() for a given subreddit and a given word. You can then collate that set of results however you want:

var Promise = require('bluebird');

function getBestSubreddit(messageText) {
  var user_words = parse_message(messageText);
  var top_subreddit = "";
  var top_score = Number.MIN_SAFE_INTEGER;
  return rp(dbUrl + '/.json?shallow=true').then(function(res) {
    res = JSON.parse(res);
    // build a list of URLs to process
    var requestData = [];
    for (var subreddit in res) {
      if (res.hasOwnProperty(subreddit)) {
        for (var i = 0; i < user_words.length; i++) {
          requestData.push({url:dbUrl + '/' + subreddit + '/word_freqs/' + user_words[i] + '.json', subreddit: subreddit, word: user_words[i]});
        }
      }
    }
    // 
    return Promise.map(requestData, function(url) {
        return rp(requestData.url).then(function(result) {
            return {result: result, subreddit: requestData.subreddit, word: requestData.word};
        });
    }, {concurrency: 20}).then(function(allResults) {
        // now filter through all the data with appropriate subreddit
        // allResults is an array of objects of this form {result: result, subreddit: subreddit, word: word}
        // return whatever you want the final result to be after processing the allResults array
    });
  }
}

getBestSubreddit(someText).then(function(result) {
    // process result here
}).catch(function(err) {
    // handle error here
});
Community
  • 1
  • 1
jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Thank you! I think the code you provided returns a list of every frequency across every subreddit, since it adds every possible URL to the urls array and then uses Promise.map() to limit the concurrency to at most 20 at a time. Is there any way to get a the list of word frequencies per subreddit instead? I want to calculate a score for each subreddit and then use that score to determine the top subreddit. For this, would I do a separate Promise.map() for each individual subreddit? – cheese123211 Apr 01 '17 at 23:00
  • @cheese123211 - Your code does not show what you want to do to process the results so I can't include that type of code nor even know what you're trying to do with the results. I showed you how to get all the results and how to control how many requests are in-flight at a time. You should be able to handle your own processing of the results. – jfriend00 Apr 02 '17 at 00:28
  • @cheese123211 - Yes, you could accumulate results per sub-reddit. The code in your question does not show what you are trying to do with the results so I didn't have anything to go by to show code related to that. – jfriend00 Apr 02 '17 at 00:30
  • @cheese123211 - Since I'm unable to understand from your question exactly what result you are trying to get, I've added a new version that returns an array of objects that has `{word, subreddit, rpResult}` in it for every request and you should be able to process that array of objects to calculate any result you are after. – jfriend00 Apr 02 '17 at 00:54
0

The problem arises from two nested loops and unthrottled rp() calls, leading to many simultaneous requests.

Throttling is typically achieved by :

  • serialising by building a then() chain, eg by reducing an array.
  • enforcing a "concurrency" limit, using eg Bluebird's Promise.map() with its concurrency option.

I guess there must be a number of approaches for this particular problem, but essentially :

  • pool all requests and throttle by concurrency (jFriend00's answer),
  • allow one loop to remain synchronous and throttle the other with serialisation or concurrency,
  • nest a serialisation in a serialisation,
  • nest a concurrency in a concurrency,
  • adopt a mixed approach of serialisation and concurrency.

Here's a mixed approach, in which :

  • the original outer loop is throttled by serialisation
  • the original inner loop is throttled by Bluebird's map with concurrency.
function getSubreddits(messageText) {
    var user_words = parse_message(messageText);
    return rp(dbUrl + '/.json?shallow=true').then(function(res) {
        var subreddits = Object.keys(JSON.parse(res));
        return subreddits.reduce(function(p, subreddit) {
            return p.then(function() {
                return Promise.map(user_words, function(word) {
                    return rp(dbUrl + '/' + subreddit + '/word_freqs/' + word + '.json');
                }, {concurrency: 10}).then(function(freqs) {
                    // return an object that associates each subreddit with its results
                    return {
                        'subreddit': subreddit, // or maybe the object for which `subreddit` is the key?
                        'word_freqs': freqs
                    };
                });
            });
        }, Promise.resolve());
    });
}

A downside is that you end up with a deeply nested eyesore, which doesn't lend itself to flattening. That said, most if not all of the other approaches would be similar.

Whichever approach you adopt, getBestSubreddit() will now comprise a call to getSubreddits() plus some post-processing of the results.

function getBestSubreddit(messageText) {
    return getSubreddits(messageText).then(function(results) {
        // Here `results` is an array of `{'subreddit', 'word_freqs'}` objects.
        // Loop through and calculate a score for each subreddit,
        // then use that score to determine the top subreddit,
        // and return it.
    }).catch(function(error) {
        console.log(error);
    });
}
Graham
  • 7,431
  • 18
  • 59
  • 84
Roamer-1888
  • 19,138
  • 5
  • 33
  • 44