2

I am implementing a web crawler written in Node and using MongoDB as a back-end for my application to store pages and their status. Crawler should be able to run on multiple machines, and besides this each machine will have multiple workers running in parallel in order to speed up the crawl process of the pending pages.

Each worker will:

  1. Query database for some amount of pages which are still pending to be crawled
  2. Update their status from "Pending" to "In Progress"
  3. Crawl them
  4. Update their status from "In Progress" to "Finished"

Having this in mind I am trying to find the way for multiple workers to NOT query for the same pages at the same time.

Every worker has his unique ID, therefore pages are just documents with structure like:

{ uri, status, workerId, <other data> }

My plan was to mark N documents with current worker id (notifying that they will be processed by this worker) and then query for them

Some thing like set workerId to <currentWorkerId> for documents which have: { "status": "Pending", "workerId": null }

And then query documents which have: { "status": "Pending", "workerId": "<currentWorkerId>" }

The problem is that as far as I see mongo not supports updates with a limit. Of course I can execute N update operations updating single documents but I wonder if there is a more idiomatic/elegant solution for this kind of task?

In the end my objective is to make sure that whenever 2 or more workers will query for pages to process they do not retrieve the same page twice.

Dmitry Papka
  • 1,201
  • 3
  • 13
  • 24
  • What version of MongoDB are you running, and are you sharded? – barrypicker Oct 23 '19 at 19:08
  • Isn't this just a question about updating multiple doc with a query? https://stackoverflow.com/questions/1740023/mongodb-how-to-update-multiple-documents-with-a-single-command – Robert Moskal Oct 23 '19 at 20:25
  • @RobertMoskal - Yes, but I think more specifically, how to structure the find portion of an update statement so the selection of records is psuedo random for even distribution. – barrypicker Oct 23 '19 at 20:29

2 Answers2

0

Well, I think I understand the objective - you wish to update all documents having a pending state and assign a worker to them. You want to distribute the workers somewhat evenly. Once the worker assignments are done, then each worker will identify their pages to scan. But you don't like the idea of walking a cursor one document at a time, and would prefer to update a set of data at a time.

Here is an example using the $where condition in an updateMany() function. Please keep in mind $where cannot use indexes. If you index on 'status' you may be ok, but this might not work from a performance perspective. My belief is that you wish to update all pending records, so performance impacts may be better this way compared to updating one record at a time. Also, my query predicates do not consider if workerId is null. This is because I believe there should never be a condition where status is 'Pending' and workerId is not null.

Assuming two workers, my idea implements two update statements, one for worker0 and another for worker1. I assume your documents have a field called _id which is an ObjectId. The strategy is to use the _id field timestamp. Look at the seconds of the timestamp. For those having a seconds value between 0 and 30 assign to worker0, all others assign to worker1. If you have more workers then this strategy would need to be altered to accommodate the number of desired workers.

worker0 Assignment:

db.pages.updateMany({"status": "Pending", $where: function(){
        var seconds = this._id.getTimestamp().getSeconds()
        if(seconds >= 0 && seconds < 30) {
            return true;
        }
        else {
            return false;
        }
    }
}, { $set: { status: "In Progress", workerId: 0} })

worker1 Assignment:

db.pages.updateMany({"status": "Pending", $where: function(){
        var seconds = this._id.getTimestamp().getSeconds()
        if(seconds >= 30) {
            return true;
        }
        else {
            return false;
        }
    }
}, { $set: { status: "In Progress", workerId: 1} })

Once these queries are run the assignments are complete. Each worker can now identify which pages to crawl by issuing their own respective query. For example:

Worker0 identify pages to crawl:

db.pages.find({status: "In Progress", workerId: 0})

Worker0 completed:

Once the worker crawls the page it can mark the record done to prevent future runs from multiple crawls.

db.pages.updateOne({_id: ObjectId("5db0b1953cf0c979dd020fa2")}, { $set: {status: "Finished"}})

Conclusion:

I am curious about your thoughts to this approach and appreciate any feedback, good or bad. Flame on!

After thoughts

A complete different approach could be assigning the worker when the records are initially inserted using a random assignment. This doesn't help records already created with a null assignment, however.

barrypicker
  • 9,740
  • 11
  • 65
  • 79
0

Without creating a separate dispatcher process to assign the work, perhaps a three stage approach.

  1. Query pending documents with a limit retrieving only the _id field. If you have an index on {status:1, workerId:1, _id:1} this could be covered for performance
  2. Update using the $in operator to set the status to In Progress and assign the worker ID
  3. Query for In Progress and worker ID

Something like:

var ids = db.pages.find({status:"pending", workerId: null},{_id:1}).limit(100).toArray().map(p=>p._id)

db.pages.updateMany({_id:{$in:ids}},{$set:{status:"In Progress", worker: MyID}})

var workcursor = db.pages.find({status:"In Progress", worker: MyID})  

If you have multiple workers coming in at the same time, there is a possibility of a race where they may both try to get the same page. You could execute the above steps in a transaction to avoid that situation.

Joe
  • 25,000
  • 3
  • 22
  • 44