35

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)

I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error

{
    "createdCollectionAutomatically" : true,
    "numIndexesBefore" : 1,
    "errmsg" : "exception: Currently only single field hashed index supported.",
    "code" : 16763,
    "ok" : 0
}

My question:

Between the indices:

  1. db.test.ensureIndex( { "key": 1 } )

  2. db.test.ensureIndex( { "key": "hashed" } )

For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)


How I got to the question:

Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.

The situation in which I made the index was:

I had a collection that contained a sorted list of documents classified by keys.

e.g. {key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...

Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.

That means that I had two possible queries:

  • For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
  • And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})

In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.

J-Rou
  • 2,216
  • 8
  • 32
  • 39
  • Thinking more about this, I'm not sure if having the hash in the key wold really be faster than a binary tree. I'm saying this because log2(20.000.000) ~= 25 and I don't know if a good hashing function is going to be much faster than checking less than 30 pointers. (In my case I won't go above 20MM keys by much) – J-Rou Feb 04 '15 at 20:50
  • 1
    If your app needs insert and delete often then probably hash index will be best – Robertiano Feb 05 '15 at 11:41
  • 4
    I believe, and I will check on this and update if I am wrong, that a hashed index is a disguised Btree index. The Btree keys are hashes instead of field values. Therefore, there's no `O(1)` vs. `O(log n)` asymptotic performance victory for hashed indexes, since they're actually Btrees storing hashes. The main point of a hashed index in MongoDB is to uniformly distribute key values, so that when a hashed index on `_id` is used as shard key you get writes evenly distributed among shards. – wdberkeley Feb 05 '15 at 15:46
  • @Robertiano Inserts are not that common, the most common operations are the two queries I posted. – J-Rou Feb 05 '15 at 16:16
  • @wdberkeley I knew that the implementation of the hashed index could be like that. The reason I wrote "usually" in `(hash usually is O(1))` is exactly that. Please let me know if you are wrong. – J-Rou Feb 05 '15 at 16:20
  • @wdberkeley if you post that as an answer I'll accept it. – J-Rou Feb 05 '15 at 16:31
  • As I understand from my limited experience, the logic of O(1) or O(log n) is not particularly relevant as it is in dealing with data structure sorting / searching. The hash is calculated based on the specific "shard" key and the result determines what shard a particular document belongs to. The caveat here is, performance will depend on the distribution of data among shards and the distribution depends on the kind of value the shard key possesses, in other words cardinality. The performance here mean how quickly the shard is located and not document for specific operation. –  Aug 03 '15 at 22:48
  • @satarangi-re This is an old question and when writing it, my intent was not to get into a sharding related discussion. The conclusion of the question was that the difference was insignificant. Now, since you wrote about sharding, I feel like I should explain better. When talking about shards, the hashed index hashes the index value trying to force a good, completely random distribution of the documents between the shards. If you use a hashed key, the `kind of value the shard key possesses` should not be important, the distribution should be good and random. – J-Rou Aug 12 '15 at 14:55
  • May I ask if it is worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ? – Daniel W. Feb 27 '17 at 13:37
  • Anyone? Any good answer? – Dima Tisnek Mar 06 '17 at 13:49

2 Answers2

19

For the query db.products.find( { key: "a" } ), which one is more performant?

Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.

If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h

Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.

Any reason not to use hash in a non-ordered field?

Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.

Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.

is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?

Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.

The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.

Wan B.
  • 18,367
  • 4
  • 54
  • 71
2

In a specific type of usage the index will be smaller!

Yes! In a very specific scenario where all three of the following conditions are satisfied.

  • Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
  • You don't need range based queries or sorting for the indexed field.
  • Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.

For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.

// The type of data in the collection. Each document is a random string with 65 characters.
{
  "myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}

The index is about 1/4 of the normal version!

mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
    // A regular index. This one is sorted by the value of myLargeRandomString
    "myLargeRandomString_-1"     : 23074062336,

    // The hashed version of the index for the same field. It is around 1/4 of the original size.
    "myLargeRandomString_hashed" : 6557511680,
}

NOTE:

If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default. As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.

robert
  • 1,402
  • 1
  • 15
  • 21