2

I'm pulling some data from Amazon Mechanical Turk and saving it in a mongodb collection.

I have multiple workers repeat each task as a little redundancy helps me check the quality of the work.

Every time I pull data from amazon using the boto AWS python interface I obtain a file containing all the completed HITs and want to insert them into the collection.

Here is the document I want to insert into the collection:

    mongo_doc = \
    {'subj_id'    :data['subj_id'],
    'img_id'      :trial['img_id'],
    'data_list'   :trial['data_list'],
    'worker_id'   :worker_id,
    'worker_exp'  :worker_exp,
    'assignment_id':ass_id
    }
  • img_id is an identifier of an image from a database of images.
  • subj_id is an identifier of a person in that image (there might be multiple per image).
  • data_list is the data I obtain from the AMT workers.
  • worker_id, worker_exp, assignment_id are variables about the AMT worker and assignment.

Successive pulls using boto will contain the same data, but I don't want to have duplicate documents in my collection.

I am aware of two possible solutions but none work exactly for me:

  1. I could search for the document in the collection and insert it only if not present. But this would have a very high computational cost.

  2. I can use upsert as a way to make sure that a document is inserted only if a certain key is not already contained. But all of the contained keys can be duplicated since the task is repeated by multiple workers.

NOTE on part 2: - subj_id, img_id, data_list can be duplicated since different workers annotate the same subject, image and could give the same data. - worker_id, worker_exp, assignment_idcan be duplicated since a worker annotates multiple images within the same assignment. - The only unique thing is the combination of all these fields.

Is there a way I can insert the mongo_doc only if it was not inserted previously?

Community
  • 1
  • 1
Matteo
  • 7,924
  • 24
  • 84
  • 129

1 Answers1

3

As long as "all" you want to do here is "insert" items then you have a couple of choices here:

  1. Create a "unique" index across all the required fields and use insert. As simply put, when the combination of values is the same as something that already exists then a "duplicate key" error will be thrown. That stops the same thing being added twice and can alert you with an exception. This is possibly best used with the Bulk Operations API and the "unordered" flag for operations. The same "unordered" is available for insert_many(), but I personally prefer the syntax of the Bulk API, as it allows better building and mixed operations:

    bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
    bulk.insert(document)
    result = bulk.execute()
    

    If multiple operations were used before .execute() is called then all are sent to the server at once and there is only "one" response. With "unordered", all items are processed regarless of errors such as "duplicate" key and the "result" contains a report of any failed items.

    The obvious "cost" here is that creating a "unique" index over all the fields will use a fair bit of space as well as adding significant overhead to "write" operations as the index information must be written as well as the data.

  2. Use "upsert" functionality with $setOnInsert. This allows you to construct a query with "all required unique fields" in order to "search" for the document to see if one exists. The standard "upsert" behaviour is in where the document is not found then a "new" document is created.

    What $setOnInsert adds, is that all fields "set" within that statement are only applied where the "upsert" occurs. On a regular "match" then all assignments inside the $setOnInsert are ignored:

    bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
    bulk.find({ 
        "subj_id": data["subj_id"], 
        "img_id": data["img_id"] 
        "data_list": data["data_list"],
        "worker_id": data["worker_id"], 
        "worker_exp": data["worker_exp"], 
        "assignment_id": data["assignment_id"]
    }).upsert().update_one({
        "$setOnInsert": {
            # Just the "insert" fields or just "data" as an object for all
            "subj_id": data["subj_id"], 
            "img_id": data["img_id"] 
            "data_list": data["data_list"],
            "worker_id": data["worker_id"], 
            "worker_exp": data["worker_exp"], 
            "assignment_id": data["assignment_id"]
        },
        "$set": {
            # Any other fields "if" you want to update on match
        }
    })
    result = bulk.execute()
    

    Depending on your needs you can use $set or other operators for any thing you "want" to update if the document is matched, or leave it out completely and only "inserts" will occur where not matched.

    What you cannot do of course is do something like assign a value of 1 to a field inside $setOnInsert and then do something like $inc on other operations. This produces a conflict where you are trying to modify the "same path" and will throw an error.

    In that case it is better to leave the $inc field "out" of the $setOnInsert block and just let it do it's operations normally. An { "$inc": 1 } will just assign 1 anyway on the first commit. Same applies to $push and other operators.

    The "cost" is again asigning an index, that does not "need" to be "unique" but probably should be. Without an index the operations are "scanning the collection" for a possible match rather than the index which is more efficient. So it is not "required", but the cost of additional "writes" usually outweighs the cost of "lookup" in the case where an index was not specified.

The further advantage when coupled with "Bulk" operations is that since the "upsert" method with $setOnInsert does not throw any "duplicate key" error when all unique keys are in the query, this can be used with "ordered" writes for the batch as demonstrated.

When "ordered" in a batch of operations, the operations are processed in the "sequence" they were added, so if it is important to you that the "first" insert to happen is the one that is comitted then it is prefferable to "unordered", which while quicker to to parallel execution, is not of course guaranteed to commit the operations in the same order in which they were contructed.

Either way, you have costs to maintaining "unique" items over multiple keys with either form. Possibly an alternate to look at to "reduce" the index cost is to look at replacing the _id field of your document with all the values you consider "unique".

Since that primary key is always "unique" and always "required" this minimizes the "cost" of writing "additional indexes" and may be an option to consider. The _id doesn't "need" to be an ObjectId, and since it can be a composite object then if you have another unique identifier then it is probably wise to use it that way, avoiding further unique duplication.

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135