2

I want to cache 10.000+ key/value pairs (both strings) and started thinking which .NET (2.0, bound to MS Studio 2005 :( ) structure would be best. All items will be added in one shot, then there will be a few 100s of queries for particular keys.
I've read MSDN descriptions referenced in the other question but I stil miss some details about implementation / complexity of operation on various collections. E.g. in the above mentioned question, there is quote from MSDN saying that SortedList is based on a tree and SortedDictionary "has similar object model" but different complexity. The other question: are HashTable and Dictionary implemented in the same way?
For HashTable, they write:

If Count is less than the capacity of the Hashtable, this method is an O(1) operation. If the capacity needs to be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.

But when the capacity is increased? With every "Add"? Then it would be quadratic complexity of adding a series of key/value pairs. The same as with SortedList.

Not mentioning OrderedDictionary, where nothing is mentioned about implementation / complexity.

Maybe someone knows some good article about implementation of .NET collections?

Community
  • 1
  • 1
MkL
  • 43
  • 6
  • I finally have time to look to source code as hinted by Joe Cheng below - thanks again! - source code is the best documentation ;) Here is my summary: -- Hashtable: when enlarging the capacity (by factor 2) rehashing occurs - which costs ca. current capacity but occurs only log n times so overall O(n) -- Dictionary: inner Hashtable object -- SortedList: two arrays for keys and values, capacity enlargement by factor 2 but anyway usually inserting involves copying elements - so 0(n^2) altogether -- SortedDictionary: based on a tree, inserting O(log n) – MkL Jul 06 '11 at 11:20

4 Answers4

2

The capacity the HashTable is different than the Count.

Normally the capacity -- the maximum number of items that can be stored, normally related to the number of underlying hash buckets -- doubles when a "grow" is required, although this is implementation-dependent. The Count simply refers to the number of items actually stored, which must be less than or equal to the capacity but is otherwise not related.

Because of the exponentially increasing interval (between the O(n), n = Count, resizing), most hash implementations claim O(1) amortized access. The quote is just saying: "Hey! It's amortized and isn't always true!".

Happy coding.

1

If you are adding that many pairs, you can/should use this Dictionary constructor to specify the capacity in advance. Then every add and lookup will be O(1).

If you really want to see how these classes are implemented, you can look at the Rotor source or use .NET Reflector to look at System.Collections (not sure of the legality of the latter).

Joe Cheng
  • 8,001
  • 42
  • 37
  • Thanks for hints: constructor and an opportunity to look at the source code - this is what I was missing. – MkL Jun 29 '11 at 00:22
1

The HashTable and Dictionary are implemented in the same way. Dictionary is the generic replacement for the HashTable.

When the capacity of collections like List and Dictionary have to increase, it will grow at a certain rate. For List the rate is 2.0, i.e. the capacity is doubled. I don't know the exact rate for Dictionary, but it works the same way.

For a List, the way that the capacity is increased means that an item has been copied by average 1.3 times extra. As that value stays constant when the list grows, the Add method is still an O(1) operation by average.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • Thanks. And what is the capacity of Dictionary? - number of buckets in hash table? How did you obtain 1.3? If capacity of list is exactly doubled, then each item should be copied 1 time extra because of re-allocations. But anyway Add is still O(1). – MkL Jun 28 '11 at 23:42
  • @MkL: The capacity of a dictionary is how many items it can hold, i.e. how large the arrays are that are used internally to store the items. The buckets doesn't exist as objects themselves, there is one array for the KeyValue items, and one array for the index of the next item in the same bucket. The number 1.3 is a result of how the items are copied. At any given time 33% to 100% of the items have been copied at least once, 1/3 of those have been copied at least twice, 1/3 of those have been copied at least thrice, and so on. – Guffa Jun 29 '11 at 06:28
1

Dictionary is a kind of hashtable; I never use the original Hashtable since it only holds "objects". Don't worry worry about the fact that insertion is O(N) when the capacity is increased; Dictionary always doubles the capacity when the hashtable is full, so the average (amortized) complexity is O(1).

You should almost never use SortedList (which is basically an array), since complexity is O(N) for each insert or delete (assuming the data is not already sorted. If the data is sorted then you get O(1), but if the data is already sorted then you still don't need to use SortedList because an ordinary List would have sufficed.) Instead of SortedList, use SortedDictionary which offers O(N log N) for insert, delete, and search. However, SortedDictionary is slower than Dictionary, so use it only if your data needs to be sorted.

You say you want to cache 10,000 key-value pairs. If you want to do all the inserts before you do any queries, an efficient method is to create an unsorted List, then Sort it, and use BinarySearch for queries. This approach saves a lot of memory compared to using SortedDictionary, and it creates less work for the garbage collector.

Qwertie
  • 16,354
  • 20
  • 105
  • 148
  • Thanks for infos. Why SortedDictionary is slower than Dictionary? Is it based on a tree? Idea with List/ Right. And List initialized with Capacity = number of key/value pair as Joe Cheng hinted below. – MkL Jun 28 '11 at 23:14
  • @MkL Neither a SortedDictionary not a SortedList use a hash code / hashing algorithm -- this is why they are `O(lg n)`/`O(n lg n)`/`O(n)` for access/inserts. (I personally think the names chosen are awful :-) A Dictionary uses a hashing algorithm (although this is not a requirement of the IDictionary interface). As far as speed -- it may or may not be faster. Big-O talks about limits. The particular values of `N` and `C` need to be taken int account for *real world performance*. –  Jun 29 '11 at 04:14
  • Not only is SortedDictionary slower in theory, I also benchmarked it: http://www.codeproject.com/KB/cross-platform/BenchmarkCppVsDotNet.aspx – Qwertie Jul 11 '11 at 16:29