I am trying to create a big data app using mongodb (coding in Java). My collection consists of ordinary text documents. Since I do not want duplicates and documents' text fields are too big to create unique index on, I decided to calculate checksum value (MessageDigest with MD5) for text of each document, save this field in the document and create a unique index on this field.
Roughly my document has a structure like:
{ "_id": ObjectId('5336b4942c1a99c94275e1e6') "textval": "some long text" "checksum": "444066ed458746374238266cb9dcd20c" "some_other_field": "qwertyuıop" }
So when I am adding a new document to my collection, first I try to find if it exists by finding a document with that checksum value. If it exists I update (other fields of) it, otherwise I insert the new document.
This strategy works! But after one million documents in the collection I started getting unacceptable insert durations. Both cheksum lookups and inserts slowed down. I can insert ~30,000 docs in almost 1 hour! I have read about bulk inserts but could not decide what to do with duplicate records if I go in that direction. Any recommendations on strategy to speed things up?