I have a script to scrape ~1000 webpages. I'm using Promise.all to fire them together, and it returns when all pages are done:
Promise.all(urls.map(url => scrap(url)))
.then(results => console.log('all done!', results));
This is sweet and correct, except for one thing - machine goes out of memory because of the concurrent requests. I'm use jsdom
for scrapping, it quickly takes up a few GB of mem, which is understandable considering it instantiates hundreds of window
.
I have an idea to fix but I don't like it. That is, change control flow to not use Promise.all, but chain my promises:
let results = {};
urls.reduce((prev, cur) =>
prev
.then(() => scrap(cur))
.then(result => results[cur] = result)
// ^ not so nice.
, Promise.resolve())
.then(() => console.log('all done!', results));
This is not as good as Promise.all... Not performant as it's chained, and returned values have to be stored for later processing.
Any suggestions? Should I improve the control flow or should I improve mem usage in scrap(), or is there a way to let node throttle mem allocation?