6

Given (Simplified description)

One of our services has a lot of instances in memory. About 85% are unique. We need a very fast key based access to these items as they are queried very often in a single stack / call. This single context is extremely performance optimized.

So we started to put them them into a dictionary. The performance was ok.

Access to the items as fast as possible is the most important thing in this case. It is ensured that there are no write operations when reads occur.

Problem

In the meanwhile we hit the limits of the number of items a dictionary can store.

Die Arraydimensionen haben den unterstützten Bereich überschritten. 
  bei System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
  bei System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)

Which translates to The array dimensions have exceeded the supported range.

Solutions like Memcached are in this specific case just too slow. It is a isolated very specific use case encapsulated in a single service

So we are looking for a replacement of the dictionary for this specific scenario.

Currently I can't find one supporting this. Am I missing something? Can someone point me to one?

As an alternative, if none exists we are thinking about implementing one by ourselves.

We thought about two possibilities. Build it up from scratch or wrapping multiple dictionaries.

Wrapping multiple dictionaries

When an item is searched we could have a look at the keys HasCode and use its starting number like an index for a list of wrappers dictionaries. Although this seems to be easy it smells to me and it would mean that the hashcode is calculated twice (one time by us one time by the inner dictionary) (this scenario is really really performance cruical).

I know that exchanging a basetype like the dictionary is the absolute last possibility and I want to avoid it. But currently it looks like there is no way to make the objects more unique or to get the performance of a dictionary from a database or to save performance somewhere else.

I'm also aware of "be aware of optimizations" but the a lower performance would very badly hit the business requirements behind it.

Kirill Rakhman
  • 42,195
  • 18
  • 124
  • 148
Boas Enkler
  • 12,264
  • 16
  • 69
  • 143
  • What was the limit you reached? 2^31? – Matthew Watson Feb 25 '16 at 08:39
  • I'm not sure wether its the count or the object size of the element, i', currently adding some logging code to this. But due to the circumstances of the services i can't get the results very quick – Boas Enkler Feb 25 '16 at 08:40
  • Also, do you control the implementation of the types you are adding to the dictionary? If so, you could at least cache the hash code so that it isn't recomputed unnecessarily. – Matthew Watson Feb 25 '16 at 08:41
  • Yes i've control over the options. Looking at cachin i must check twice wether the instances are immutable. – Boas Enkler Feb 25 '16 at 08:43
  • 2
    Dictionary already uses HashCodes to split up the entries into multiple independent buckets. – Damien_The_Unbeliever Feb 25 '16 at 08:44
  • would an in-memory sqlite database be too slow? You can read more about it [here](https://www.sqlite.org/inmemorydb.html) – Steffen Winkler Feb 25 '16 at 08:45
  • How do you know the service isn't slowing down application? How large are the queries into the dictionary and how often do the queries occur? – jdweng Feb 25 '16 at 08:45
  • @SteffenWinkler well the database would be much slower then the dictionary. and the performance of the dictionary was just "ok" not good – Boas Enkler Feb 25 '16 at 08:47
  • @jdweng I don't understand your question. the queries are similar to how a graph database works. so you walk about items, finde their refs and so . But this is a very very simplified description – Boas Enkler Feb 25 '16 at 08:49
  • 4
    https://msdn.microsoft.com/en-us/library/hh285054%28v=vs.110%29.aspx – Hans Passant Feb 25 '16 at 08:49
  • 1
    @BoasEnkler oh okay. In that case I'd seriously think about rewriting that 'look-up' and 'store' part of the application and capsule it in a seperate assembly that is written in a language that is better suited for such performance-critical things. – Steffen Winkler Feb 25 '16 at 08:50
  • 1
    @Steffen Winkler, I am also trying to find a solution for a situation similar to what is mentioned here. Can you please suggest the "language that is better suited for such performance.critical things"? – Krishna Kumar N Feb 25 '16 at 08:55
  • 1
    @KrishnaKumarN ah sorry, I've no particular one in mind. But I guess rewriting that in C/C++ would already render a performance improvement. You could take a look at [this](https://attractivechaos.github.io/plb/) or similiar sites. – Steffen Winkler Feb 25 '16 at 09:11
  • I really think that if you've trying to store more than _two billion_ items in a `Dictionary`, you need to step back and consider whether you really need that many elements, or whether an alternative datastore would better fit your needs. At the very least, the overhead from the `Array.Resize()` operations whenever the Dictionary needs to grow is not going to be insignificant. – Ian Kemp Feb 25 '16 at 09:16

2 Answers2

2

Before I finished reading your questions, the simple multiple dictionaries came to my mind. But you know this solution already. I am assuming you are really hitting the maximum number of items in a dictionary, not any other limit.

I would say go for it. I do not think you should be worried about counting a hash twice. If they keys are somehow long and getting the hash is really a time consuming operations (which I doubt, but can't be sure as you did not mention what are the keys), you do not need to use whole keys for your hash function. Just pick up whatever part you are OK to process in your own hashing and distribute the item based on that.

The only thing you need to make sure here is to have an evenly spread of items among your multiple dictionaries. How hard is to achieve this really depends on what your keys are. If they were completely random numbers, you could just use the first byte and it would be fine (unless you would need more than 256 dictionaries). If they are not random numbers, you have to think about the distribution in their domain and code your first hash function in a way it achieves that goal of even distribution.

Wapac
  • 4,058
  • 2
  • 20
  • 33
2

I've looked at the implementation of the .Net Dictionary and it seems like you should be able to store 2^32 values in your dictionary. (Next to the list of buckets, which are themselves linked lists there is a single array that stores all items, probably for quick iteration, that might be the limiting factor).

If you haven't added 2^32 values it might be that there is a limit on the items in a bucket (its a linked list so its probably limitted to the maximum stackframe size). In that case you should double check that your hashing function spreads the items evenly over the dictionary. See this answer for more info What is the best algorithm for an overridden System.Object.GetHashCode?

Community
  • 1
  • 1
Roy T.
  • 9,429
  • 2
  • 48
  • 70