4

,My scenario is that I have a collection consisting of many documents to be processed - one document at a time. It takes a relatively long time to process a document, and it will take many hours to process the whole collection. Therefore, I will have multiple simultaneous 'workers' processing the same collection. Each needs to do something like,

(A) get the next unprocessed document,

(B) process it,

(C) mark the document as processed, and continue.

How do I ensure that the simultaneous processes do not read the same documents? I do not know what the key values will be, so I can't say something like process_A should start at 1 and process_B start at a million. Also I would like to add as many processes as manageable, so it is not practical to say one go forwards and another go backwards.

I ask about MongoDB because that is what I am using. I imagine the same question could be asked about a SQL database.

I implore anyone who wants to help, not to focus on changing the scenario, which for whatever external reasons, is a given.

Thank you

sdfor
  • 6,324
  • 13
  • 51
  • 61
  • Can you use skip and limit to partition the collection to your liking and assign the workers to these partitions? – TeTeT Nov 26 '15 at 00:59
  • @TeTeT Skip probably wouldn't be great options for this because internally skip still needs to process all the docs it skips. Would be very inefficient. – David says Reinstate Monica Nov 26 '15 at 02:44
  • There is a discussion in the comments about this problem in mysql. I suspect it will work here as well. maybe interesting? [Running a Cron job continuously](http://stackoverflow.com/questions/32700321/running-a-cron-job-continuously#comment53244570_32700321) – Ryan Vincent Nov 26 '15 at 03:10

1 Answers1

0

I would recommend using some thread safe resource to maintain a set of read documents. As your workers read a document they try to drop the _id of the document in that resource. If it doesn't exist the worker should process the document, if it does then the worker should move to the next document.

As for what this thread-safe resource may be, Mongo is actually a pretty good option. It has document-level atomicity, so you can just create a new collection of 'parsed docs'. Every time you try to parse a doc you insert its _id into that collection and if the write result says you inserted 1 doc then you know its new.

David says Reinstate Monica
  • 19,209
  • 22
  • 79
  • 122
  • What I am concerned about is that in the time gap between when I grab a document and mark it as being processed, either by deleting it, or updating a flag or writing its id in another collection, - that in that time gap another process will grab the file as well. – sdfor Nov 29 '15 at 00:51
  • @sdfor if you don't start processing until after Mongo confirms a new document has been created you won't have any issues with that. Don't update a flag in document, create a new document with just the_id field – David says Reinstate Monica Nov 29 '15 at 01:02
  • Gotcha,- so the logic has to be: grab a document, write it to the id collection, if the write works, ie. not a duplicate then process it. otherwise another process has it, so go on to the next document. And, after a document has been processed, then flag the document so no process even tries it again - and that's because mongo doesn't have a join. and I can't say, get the next document in the data collection that does not have its _id in the id collection – sdfor Nov 29 '15 at 03:13
  • @sdfor if you write the _ids to their own collection then you don't need any flags. The document itself is the flag. What you are looking for is the write result of the insert: if it returns 1 document created then it's new and should be processed. If it's says 0 documents created then another process has already begun work on this doc – David says Reinstate Monica Nov 29 '15 at 03:19