30

In C#.NET, I like using HashSets because of their supposed O(1) time complexity for lookups. If I have a large set of data that is going to be queried, I often prefer using a HashSet to a List, since it has this time complexity.

What confuses me is the constructor for the HashSet, which takes IEqualityComparer as an argument:

http://msdn.microsoft.com/en-us/library/bb359100.aspx

In the link above, the remarks note that the "constructor is an O(1) operation," but if this is the case, I am curious if lookup is still O(1).

In particular, it seems to me that, if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).

Does the implementation internally construct a lookup table as elements are added to the collection?

In general, how might I ascertain information about complexity of .NET data structures?

Kirby
  • 3,649
  • 3
  • 26
  • 28
  • Just test it with varying input sizes and see if the lookup time scales or remains constant. Pretty sure that the documentation is correct however. – Ed S. Mar 21 '12 at 20:07
  • It is *still* a HashSet once the constructor is over. The source data-structure itself is not kept (e.g. there is no "proxy" in this case). Lookup is O(1) but insert is *amortized* O(1). –  Mar 21 '12 at 20:08
  • @Kirby That doesn't change. You could construct the HashSet from an IEnumerable or add the elements individually later: the only thing that *might* be different, which does not affect the [lookup] time complexity, is the capacity. –  Mar 21 '12 at 20:09
  • Just read a bit about how a hash table really works: http://en.wikipedia.org/wiki/Hash_table (a hash set is a simplified hash table) – The Nail Mar 21 '12 at 20:12

5 Answers5

30

A HashSet works via hashing (via IEqualityComparer.GetHashCode) the objects you insert and tosses the objects into buckets per the hash. The buckets themselves are stored in an array, hence the O(1) part.

For example (this is not necessarily exactly how the C# implementation works, it just gives a flavor) it takes the first character of the hash and throws everything with a hash starting with 1 into bucket 1. Hash of 2, bucket 2, and so on. Inside that bucket is another array of buckets that divvy up by the second character in the hash. So on for every character in the hash....

Now, when you look something up, it hashes it, and jumps thru the appropriate buckets. It has to do several array lookups (one for each character in the hash) but does not grow as a function of N, the number of objects you've added, hence the O(1) rating.

To your other question, here is a blog post with the complexity of a number of collections' operations: http://c-sharp-snippets.blogspot.com/2010/03/runtime-complexity-of-net-generic.html

Scott Stafford
  • 43,764
  • 28
  • 129
  • 177
  • I believe hashing into buckets occurs in case of collision – sll Mar 21 '12 at 20:09
  • 8
    @sll hashing into buckets always occurs; if there's no collision, the bucket holds one item. – phoog Mar 21 '12 at 20:11
  • 3
    Thank you, Scott. For some reason, your explanation was very clear to me, particular due to the bit about calling, "IEqualityComparer.GetHashCode." It makes a lot of sense, now. – Kirby Mar 21 '12 at 20:21
  • 6
    @Kirby the basic explanation here is correct, but the character-based hashing implementation is not. In fact, there's only one array indexing operation to retrieve the correct bucket for a hash code (no jumping); the index is calculated as `hashCode % buckets.Length`. – phoog Mar 21 '12 at 20:25
  • @phoog: Thank you, and I certainly concur. Do you know how the implementation chooses the number of buckets? And what it does exactly inside each bucket when there's more than one element in there? – Scott Stafford Mar 21 '12 at 20:35
  • 1
    Yes. There's a table of prime numbers, and, for a given required capacity, the list chooses the next larger number from the table and uses that instead (or, if the required capacity is higher than the largest number, there is a function to calculates the next highest prime number). There's an array of that (prime number) size; the array elements are nodes, each of which is the first node of a singly-linked list... – phoog Mar 21 '12 at 20:50
  • 2
    ... When an item is added to a non-empty bucket, there's a linear search on the list to see whether the item is already present, and, if not, the item is added to the end of the list. For retrieval, it's necessary to do a linear search on the list until a match is found. That's why good hash distribution is important. If your hash function is `return 0;` then your hash set will have the O(n) performance of a linked list, but will take up much more memory because of the huge array of buckets. – phoog Mar 21 '12 at 20:53
  • @ScottStafford: If you're curious about the exact details, Microsoft offers the source code for download at http://referencesource.microsoft.com/netframework.aspx . – Brian Mar 21 '12 at 21:56
  • maybe this is a question more to @Scott Stafford -- if i implementing the GetHashCode() method in IEqualityComparer like so: "return StringComparer.InvariantCultureIgnoreCase .GetHashCode(item.ID);" . My question is what happens if the Equals methods return false? does it mean (theoretically) you will have "duplicate" items in the hashset? – user384080 May 03 '14 at 01:35
18

if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).

Let's call the value you are searching for the "query" value.

Can you explain why you believe the comparer has to be executed on every key to see if it matches the query?

This belief is false. (Unless of course the hash code supplied by the comparer is the same for every key!) The search algorithm executes the equality comparer on every key whose hash code matches the query's hash code, modulo the number of buckets in the hash table. That's how hash tables get O(1) lookup time.

Does the implementation internally construct a lookup table as elements are added to the collection?

Yes.

In general, how might I ascertain information about complexity of .NET data structures?

Read the documentation.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • 3
    To expand on "Read the documentation", in some places the documentation is a bit sparse. In that case for most of the framework assemblies you can simply read the source code(!), that Microsoft provides via the [Reference Source Program](http://referencesource.microsoft.com/). Of course anything not documented is potentially subject to change, but in many cases you can determine some facts that are not likely to change. – Kevin Cathcart Mar 21 '12 at 21:01
  • "Unless of course the hash code supplied by the comparer is the same for every key!".. so what happens if the same hashcode value returned and the item managed to be added in the hashset collection? – user384080 Apr 30 '14 at 06:43
  • @user384080: Then the stated belief is true. That's what "unless" means in that sentence. – Eric Lippert Apr 30 '14 at 07:21
  • maybe this is a question more to @Scott Stafford -- if i implementing the GetHashCode() method in IEqualityComparer like so: "return StringComparer.InvariantCultureIgnoreCase .GetHashCode(item.ID);" . My question is what happens if the Equals methods return false? does it mean (theoretically) you will have "duplicate" items in the hashset? – user384080 Apr 30 '14 at 23:46
  • @EricLippert let's assume I have an unsorted array and I would like to know if some values exists in it or not. I would usually put the array values into a Dictionary and look them up with O(1), is there any difference if I use HashSet? or is it the same complexity? – Gilad Nov 22 '18 at 09:12
  • 1
    @Gilad: Good question; hash sets are the same complexity as a dictionary. Or, another way to look at it is that a dictionary is a hash set that stores (key, value) pairs, but only uses the key to get the hash. – Eric Lippert Nov 22 '18 at 15:51
4

Actually the lookup time of a HashSet<T> isn't always O(1).

As others have already mentioned a HashSet uses IEqualityComparer<T>.GetHashCode().
Now consider a struct or object which always returns the same hash code x.

If you add n items to your HashSet there will be n items with the same hash in it (as long as the objects aren't equal).
So if you were to check if an element with the hash code x exists in your HashSet it will run equality checks for all objects with the hash code x to test wether the HashSet contains the element

nikstffrs
  • 344
  • 1
  • 7
3

It would depends on quality of hash function (GetHashCode()) your IEqualityComparer implementation provides. Ideal hash function should provide well-distributed random set of hash codes. These hash codes will be used as an index which allows mapping key to a value, so search for a value by key becomes more efficient especially when a key is a complex object/structure.

the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).

This is not how hashtable works, this is some kind of straightforward bruteforce search. In case of hashtable you would have more intelligent approach which uses search by index (hash code).

sll
  • 61,540
  • 22
  • 104
  • 156
  • the OP is asking about `HashSet`, not `Hashtable` (and the implementation details are somewhat different). – phoog Mar 21 '12 at 20:12
  • Thanks for noting that, I'm not sure but want to make things clear, this is what I've found in [MSDN](http://msdn.microsoft.com/en-us/library/bb397727.aspx): `The HashSet(Of T) class is based on the model of mathematical sets and provides high-performance set operations similar to accessing the keys of the Dictionary(Of TKey, TValue) or Hashtable collections. In simple terms, the HashSet(Of T) class can be thought of as a Dictionary(Of TKey, TValue) collection without values.` – sll Mar 21 '12 at 20:17
  • 1
    that's true. `HashSet` and the `Dictionary` actually use the same internal class to handle the core logic. The non-generic Hashtable uses a different implementation, but the performance characteristics would be similar. Your description of the importance of the hash function applies to both (which I failed to note) so +1. – phoog Mar 21 '12 at 20:19
  • @phoog: thanks, you've forced me to read documentation and find something new about `HashSet`, this is always great – sll Mar 21 '12 at 20:27
1

Lookup is still O(1) if you pass an IEqualityComparer. The hash set still uses the same logic as if you don't pass an IEqualityComparer; it just uses the IEqualityComparer's implementations of GetHashCode and Equals instead of the instance methods of System.Object (or the overrides provided by the object in question).

phoog
  • 42,068
  • 6
  • 79
  • 117