0

I'm developing Node.js scraping API with Puppeteer which is scraping over 1000 links daily and I want it to do that automatically.

I'm looking for task queuer that awaits for link scraper function to be done before continues to next scrape function.

I've found bull.js which uses Redis. I'm using MongoDB for database and I don't want to run another DB. So bull is not suitable for me.

Any suggestions? I'll really appreciate it if you help me.

  • 1
    Assuming you have some array/list of all the URLs you want to scrape, this is pretty easy with a short amount of code. Here are a couple of implementations in prior answers here: [Too many requests consumes all my RAM](https://stackoverflow.com/questions/46654265/promise-all-consumes-all-my-ram/46654592#46654592), [How to control how many promises access network in parallel](https://stackoverflow.com/questions/41028790/javascript-how-to-control-how-many-promises-access-network-in-parallel/41028877#41028877). – jfriend00 Jan 25 '20 at 03:27
  • 1
    And, some more [Loop through an API with variable URL](https://stackoverflow.com/questions/48842555/loop-through-an-api-get-request-with-variable-url/48844820#48844820), [Choose proper async method for batch processing for max requests per second](https://stackoverflow.com/questions/36730745/choose-proper-async-method-for-batch-processing-for-max-requests-sec/36736593#36736593), [Nodejs async request with a list of URLs](https://stackoverflow.com/questions/47299174/nodejs-async-request-with-a-list-of-url/47299802#47299802). – jfriend00 Jan 25 '20 at 03:33
  • 1
    These answers contain four separate functions (that each vary slightly in how you control them), but for less than 30 lines of code, you can just copy one of these into your project and use it. Likewise the Bluebird and Async libraries both contain multiple functions for managing how many requests are in-flight at the same time if you want to grab a library solution. – jfriend00 Jan 25 '20 at 03:35
  • 1
    FYI, you will typically get the best performance in node.js if you run N requests in parallel, not just 1. You will have to experiment with values for N because it depends upon how much CPU-intensive stuff you're doing in processing the results and how much memory each one takes to do its job. A good starting point is often to start with some number like 5 in flight at once and then test it at 5, at 10 and at 3 and see which direction gives you better performance with acceptable memory usage. – jfriend00 Jan 25 '20 at 03:37

0 Answers0