3

I am trying to create a big data app using mongodb (coding in Java). My collection consists of ordinary text documents. Since I do not want duplicates and documents' text fields are too big to create unique index on, I decided to calculate checksum value (MessageDigest with MD5) for text of each document, save this field in the document and create a unique index on this field.

Roughly my document has a structure like:

{
"_id": ObjectId('5336b4942c1a99c94275e1e6')
"textval": "some long text"
"checksum": "444066ed458746374238266cb9dcd20c"
"some_other_field": "qwertyuıop"
}

So when I am adding a new document to my collection, first I try to find if it exists by finding a document with that checksum value. If it exists I update (other fields of) it, otherwise I insert the new document.

This strategy works! But after one million documents in the collection I started getting unacceptable insert durations. Both cheksum lookups and inserts slowed down. I can insert ~30,000 docs in almost 1 hour! I have read about bulk inserts but could not decide what to do with duplicate records if I go in that direction. Any recommendations on strategy to speed things up?

salihcenap
  • 1,927
  • 22
  • 25
  • 1
    Do you have a compound index on `checksum` and `update_time` and are you trying to do an update with the upsert option set to true? – Anand Jayabalan Apr 01 '14 at 11:23
  • Sorry the information I gave was wrong. There is no update_time query. Just cheksum. I corrected the question. But there is index on "textval". Can it be the reason for slowness? – salihcenap Apr 01 '14 at 12:55

1 Answers1

1

I think it would be much faster if you used another collection containing only the checksum and update_time filelds. And when you insert your normal JSON document, then you should insert this short JSON document as well:

Your normal JSON document:
{
"_id": ObjectId('5336b4942c1a99c94275e1e6')
"textval": "some long text"
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
"some_other_field": "qwertyuıop"
}

The short JSON document:
{
"_id": ...
"checksum": "444066ed458746374238266cb9dcd20c"
"update_time": new Date(1396220136948)
}
Kalman
  • 150
  • 10