I'm pulling some data from Amazon Mechanical Turk and saving it in a mongodb collection.
I have multiple workers repeat each task as a little redundancy helps me check the quality of the work.
Every time I pull data from amazon using the boto AWS python interface I obtain a file containing all the completed HITs and want to insert them into the collection.
Here is the document
I want to insert into the collection
:
mongo_doc = \
{'subj_id' :data['subj_id'],
'img_id' :trial['img_id'],
'data_list' :trial['data_list'],
'worker_id' :worker_id,
'worker_exp' :worker_exp,
'assignment_id':ass_id
}
img_id
is an identifier of an image from a database of images.subj_id
is an identifier of a person in that image (there might be multiple per image).data_list
is the data I obtain from the AMT workers.worker_id
,worker_exp
,assignment_id
are variables about the AMT worker and assignment.
Successive pulls using boto will contain the same data, but I don't want to have duplicate documents in my collection.
I am aware of two possible solutions but none work exactly for me:
I could search for the document in the collection and insert it only if not present. But this would have a very high computational cost.
I can use upsert as a way to make sure that a document is inserted only if a certain key is not already contained. But all of the contained keys can be duplicated since the task is repeated by multiple workers.
NOTE on part 2:
- subj_id
, img_id
, data_list
can be duplicated since different workers annotate the same subject, image and could give the same data.
- worker_id
, worker_exp
, assignment_id
can be duplicated since a worker annotates multiple images within the same assignment.
- The only unique thing is the combination of all these fields.
Is there a way I can insert the mongo_doc
only if it was not inserted previously?