4

lWe all know that when we want to find one item in a collection based on some key value that Dictionary/Hashset etc are the fastest options available in C#. But, obviously they rely on setting up buckets and calling a hash function on any key value used as an argument for a lookup - both of which have some overhead.

So by rights this must mean that - for a collection up to a certain size - looping over each item in a list/array looking for a match by "brute force" must be quicker (i.e. the List.Contains method)

There's an article at http://www.dotnetperls.com/dictionary-time that suggests this threshhold is three items. Frankly I'm surprised that a Dictionary performs better with so few items!

I'm curious as to whether any of you out there have done your own benchmarks and can verify this. I'm also curious about the time required to instantiate the Dictionary and List - which the article above has left out (and frankly in most of the insert-light/read-heavy situations we'd use a dictionary for it's probably irrelevant - but in some cases this could be an important factor in deciding which to use).

Also: if this is the case (and a Dictionary really is a better choice than a List with four or more values) then why is it so? The example benchmarked in the article uses string keys - is there a much bigger performance cost to the default string equality operator/IEquatable implementation than I realise? Does a Dictionary always call on the key's IEquatable implementation during a lookup - or only in the case of a hash collision?

And finally: would this threshhold of three items be much different if the type of the key were something with a simpler equality test (like an Int32/Int64/Guid)?

Daniel Scott
  • 1,758
  • 16
  • 13
  • 1
    Have YOU done any bench-marking? Yes, a dictionary always has to call Equal to confirm since multiple strings can (will) hash to the same value. Dictionary can REJECT many mismatches based on the hash alone, which could be why it wins out fairly quickly, but this also depends on the access pattern. E.g. if the most sought after items are towards the beginning of a list the list search could win. – 500 - Internal Server Error Mar 04 '14 at 01:56
  • 1
    Hi "500" - no I've not done any benchmarking myself. Though it might seem a bit back-to-front I thought I might ask if someone here had done some pretty thorough benchmarks of their own (and have the discussion out here in the wild) before diving in myself. If noone has bothered, and some of you think it would be worthwhile then I'll cook up a few tests. – Daniel Scott Mar 04 '14 at 02:03
  • possible duplicate of [HashSet vs. List performance](http://stackoverflow.com/questions/150750/hashset-vs-list-performance) – nawfal Jun 12 '14 at 10:41

2 Answers2

2

The ListDictionary class is provided for the very reason you mention, and it's described here, and the suggestion is:

Recommended for collections that typically include fewer than 10 items.

Microsoft also provide a HybridDictionary described here, to allow you to get the best of both worlds. It describes its typical usage as follows:

This class is recommended for cases where the number of elements in a dictionary is unknown. It takes advantage of the improved performance of a ListDictionary with small collections, and offers the flexibility of switching to a Hashtable which handles larger collections better than ListDictionary.

As for your specific case, the only way to see which performs best is to benchmark.

(Note that the examples above are for information purposes only! You will generally be much better off using the new .NET generic collections...)

Baldrick
  • 11,712
  • 2
  • 31
  • 35
  • 1
    Hi Baldrick (fine nickname btw): Since both ListDictionary and HybridDictionary are from the .Net 1.1 days before generics - surely the cost of boxing when using reference type keys/values negates a good part of any performance improvement they might offer? This is of course a good question in itself ... do any of you out there use these collections in this (post .net 2.0) era? – Daniel Scott Mar 04 '14 at 02:27
  • @Daniel Scott: Thanks! I'm not exactly recommending you use them - I was just giving evidence from MS of the kind of size where the dictionary approach tends to outperform lists (i.e. around 10), and the approach they took to get a more general solution (building a HybridDictionary class). It's worth mentioning that the old collections only *box/unbox* with value types - for reference types, you're just *casting*. – Baldrick Mar 04 '14 at 02:35
  • *ugh* - I'm must be having a bit of a dopey day today - I meant to say "when using value types" (not when using reference types). But yes, you're right - a ListDictionary of string key/some-reference-type value should perform pretty well. I guess it's a bit of dogma many of us follow avoiding these old non-generic collections so universally. – Daniel Scott Mar 04 '14 at 02:49
  • @DanielScott: In general, avoiding the old collections is a good plan! :) I've added a note to that effect in my answer. – Baldrick Mar 04 '14 at 02:55
1

The reason your article probably doesn't go into detail about the costs of setting up the dictionary/list is that it's largely trivial. For that matter, if you're going to do a single look up in a data structure it really doesn't matter how you implement it because it will take a minuscule amount of time.

What we care about are the accesses because generally we are going to access a data structure a number of times and the effect of those repeated accesses will greatly outweigh any gains in setup time.

In regards to why a list is slower even with so few items: It's because that's not what lists are designed for. The key here is that computation is generally much faster than memory access. If you're looking for a specific thing in your data structure, having an algorithm that tells you where to look (a hash function) with minimal memory access lets you speed things up quite noticeably. If you need to access items sequentially, as is often the case with strings, then a list is what you want.

  • Hi Daniel, thanks for the reply. That's an interesting point (computation faster than data access). I would have thought though that the time to access each item (by index) in a list (which is a pretty minimal wrapper around an array) should be very quick - so then the overhead in iterating over a handful of list items should be pretty minimal. Am I wrong in that assumption? Or again, does this come down to data type? (since a string is a reference type, and getting a list item would involve following a memory address in the array)? – Daniel Scott Mar 04 '14 at 02:39
  • 1
    If it's a small list, accessing each item will indeed be pretty quick. But calculating a simple hash function and then doing a single access will still often be faster. When you're doing either over and over again, the difference will become noticeable. – Daniel Rabiega Mar 04 '14 at 14:15