Dictionary<> performance on sequential vs random

Question

I'm using Dictionary<long, object> to store millions of entries. The numbers are added in bulks of sequencing numbers.

I remember some hashing algorithms having trouble with keys being added in sequence.

Is this the case for .Net?
If so, what are my options? (any neat lib?)

The data is fairly static once added. Would it be worth the effort to add the data through a randomizer?

PS I've already checked out:

score 1 · Answer 1 · answered Jul 13 '11 at 20:17

1

The performance of queries should be independent of the order keys are added to the hash table. Inserting elements is easily O(1) amortized via chaining, even in the presence of collisions.

Have you actually measured a performance problem? If not, don't bother making changes. If so, consider writing a class optimized for sequential indexes.

answered Jul 13 '11 at 20:17

Craig Gidney

17,763
5
68
136

2

There are extreme cases where the performance actually depends on order of insertions. Each bucket has its own linked list. If all items fell into the same bucket, the lookup of the last inserted item would be almost instantaneous, but it would be O(N) for the first inserted item. Of course, this would mean that you have horrible hash function. Also, inserting is not guaranteed O(1) in the presence of collisions, because there is a check whether the item is already in the dictionary. That check is linear in the size of the bucket. – svick Jul 13 '11 at 20:48

score 0 · Answer 2 · answered Jul 13 '11 at 21:24

Note: by “sequence”, I mean sequence of numbers incrementing by one.

Actually, if the only keys added to the dictionary were in a sequence (without duplicates or gaps), that's the best possible situation. In the current implementation of .Net (which may change at any time, so you shouldn't depend on any of this), long.GetGashCode() for all sequences of numbers returns a sequence of numbers. And the bucket number is computed modulo capacity of the dictionary. That means that in this case, you are guaranteed to have no collisions.

If you have multiple sequences of the same length, the worst case is that all of them collide and each used bucket will contain one item for each sequence. This is not very likely, though. And in the average case, you'll get some collisions, but the average retrieval time will most likely still going to be O(1).

(There is one tiny lie in the above. For each crossing of the 32-bit boundary, the sequence of hashcodes for a sequence is going to have a gap of one number, because of the way long.GetHashCode() is implemented.)

@Tedd, yes it does. `int.GetHashCode()` directly returns that `int`. — svick, Jul 14 '11 at 15:18

score 0 · Answer 3 · answered Jul 13 '11 at 22:28

Dictionary can have a lot of overhead for that many items, and it relies on a good hash distribution for ideal performance.

You might want to run some benchmarks against other approaches, would it be possible to simply allocate an array and use the key as the index? Such as object[long], if you only have possible values of 0 to 1 million then that would take less than 8MB for the array and be much faster than a Dictionary.

If you can't do that directly, you could have a lookup of unique long to int index? Such as having a dictionary that lets you translate the long to an int that is constantly increasing, when a new long comes in you havent seen before it gets assigned a location in the array.

Or possibly have a more complex approach with jagged arrays such as object[sequenceInt][uniqueIndexInt]. It really depends on how you will be accessing the data later

Dictionary<> performance on sequential vs random

3 Answers3