6

Mongo Docs discuss the max index size.

Index Key
The total size of an indexed value must be less than 1024 bytes. 
MongoDB will not add that value to an index if it is longer than 1024 bytes.

Using db.collection.stats(), I can see that my average document size is 5 MB. If I'm indexing on a field that takes up 50% of the document, does that mean the index size would be 50% * 5 MB = 2.5 MB?

I'm confused as to how the index size is calculated for a single document.

Kevin Meredith
  • 41,036
  • 63
  • 209
  • 384

1 Answers1

4

I'm unsure as to why you're trying to index such large fields, but as it says in the documentation, it will not index a single field with more than 1024 bytes. If you're indexing a field that is 2.5MB, it's not really indexing it, it's being skipped.

If you need to index really large field data, you'll need to come up with a way to represent it in a manner that fits in under 1024 bytes. You might be able to compute a CRC32 for example and index that instead. It's unlikely that it will be perfect though, but it might be "good enough".

Just to show a bit of the oddities of the indexing, I've thrown together a simple demo.

  1. New database (test)
  2. Create an index on the value field
  3. Show the stats
  4. Create 1000 documents, with a unique field value that is 102500 characters long and unique for each document
  5. Show stats.

Example:

> db.test.drop()
true
> db.test.ensureIndex({value:1})
> db.test.stats()
{
        "ns" : "test.test",
        "count" : 0,
        "size" : 0,
        "storageSize" : 8192,
        "numExtents" : 1,
        "nindexes" : 2,
        "lastExtentSize" : 8192,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 16352,
        "indexSizes" : {
                "_id_" : 8176,
                "value_1" : 8176
        },
        "ok" : 1
}
> var data="";for(var i=0;i<102500;i++){ data+= "z";};for(var i=0;i<1000;i++){ db.test.insert({value: data + i.toString() })};
> db.test.stats()
{
        "ns" : "test.test",
        "count" : 1000,
        "size" : 106480000,
        "avgObjSize" : 106480,
        "storageSize" : 123248640,
        "numExtents" : 8,
        "nindexes" : 2,
        "lastExtentSize" : 37625856,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 49056,
        "indexSizes" : {
                "_id_" : 40880,
                "value_1" : 8176
        },
        "ok" : 1
}

You'll see how the storage size has ballooned (storageSize), but the totalIndexSize remains small. It's covering the _ids primarily.

You can also see details for a specific index using this technique (http://docs.mongodb.org/manual/faq/storage/#how-can-i-check-the-size-of-indexes).

You can see how the value index is small (size):

> db.test.$value_1.stats()
{
        "ns" : "test.test.$value_1",
        "count" : 1,
        "size" : 8176,
        "avgObjSize" : 8176,
        "storageSize" : 36864,
        "numExtents" : 1,
        "nindexes" : 0,
        "lastExtentSize" : 36864,
        "paddingFactor" : 1,
        "systemFlags" : 0,
        "userFlags" : 0,
        "totalIndexSize" : 0,
        "indexSizes" : {

        },
        "ok" : 1
}
WiredPrairie
  • 58,954
  • 17
  • 116
  • 143
  • If my documents look like: `{ _id : 1, favoriteFood : "cheese" }` and I indexed on `favoriteFood`, what would my index size be? – Kevin Meredith Oct 02 '13 at 20:52
  • It's going to vary a bit, as it's stored in blocks of a B-Tree, and the growth pattern of the indexing space. Further, as documents are changed, there may be some non-utilized space. – WiredPrairie Oct 02 '13 at 21:10
  • I added some more details with a technique for looking a bit deeper at an index. – WiredPrairie Oct 02 '13 at 21:15
  • 1
    Rather than CRC32 hashing to 'compress' an indexed field, a longer hash that avoids hash collisions should be used. SHA1 (160 bit) is likely sufficient, but SHA512 would further minimise hash collisions. Indeed, git identifies repository objects by their SHA1 hashes without problems. For more: 'A short note about SHA-1' in https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection and also: http://stackoverflow.com/questions/4014090/is-it-safe-to-ignore-the-possibility-of-sha-collisions-in-practice – Richard EB Sep 09 '16 at 20:17