I am implementing a web crawler written in Node and using MongoDB as a back-end for my application to store pages and their status. Crawler should be able to run on multiple machines, and besides this each machine will have multiple workers running in parallel in order to speed up the crawl process of the pending pages.
Each worker will:
- Query database for some amount of pages which are still pending to be crawled
- Update their status from "Pending" to "In Progress"
- Crawl them
- Update their status from "In Progress" to "Finished"
Having this in mind I am trying to find the way for multiple workers to NOT query for the same pages at the same time.
Every worker has his unique ID, therefore pages are just documents with structure like:
{ uri, status, workerId, <other data> }
My plan was to mark N
documents with current worker id (notifying that they will be processed by this worker) and then query for them
Some thing like set workerId to <currentWorkerId>
for documents which have: { "status": "Pending", "workerId": null }
And then query documents which have: { "status": "Pending", "workerId": "<currentWorkerId>" }
The problem is that as far as I see mongo not supports updates with a limit. Of course I can execute N
update operations updating single documents but I wonder if there is a more idiomatic/elegant solution for this kind of task?
In the end my objective is to make sure that whenever 2 or more workers will query for pages to process they do not retrieve the same page twice.