6

I have been recently drilled in a couple of interviews about Hashtables and when is it neccessary to override the GetHashCode(). The discussion kept going deeper and deeper until I threw in the towel.

I am now doing some research to cover everything to be ready for next time.

I have found this excellent article that I would like to share: http://msdn.microsoft.com/en-us/library/ms379571(VS.80).aspx#datastructures20_2_topic5

1) Something I don't feel very comfortable with are the fact that Dictionaries are Hash based, but Lists are apparently not. Does that only mean that searching in a List<> and Array[] is linear, while searching in a dictionary or hashtable is constant and therefore much faster? Is this all to it?

2) If I use a class as a key in a dictionary, I need to override GetHashcode() on that class based on any required identification fields to make the instances unique. However it still could happen that both ID fields are equal and the same hashcode would be generated? If this is the case what happens during a collision of the two instances with the same hashcode?

3) How can the collision be resolved? I read in the article about rehashing methodology in case of collision for Hashtable and Chaining for the Dictionary. But I am still not sure how it works exactly as I am not a Math genius. :-\ Can anybody explain it better how it works?

Many Thanks, Kave

Houman
  • 64,245
  • 87
  • 278
  • 460
  • 2
    If same hashcode is generated the equals function is run on the object to determined equality.So don't forget to override that function also. – Magnus Feb 07 '11 at 15:28
  • I just wanted to thank everyone who contributed. I had an interview and they asked for HashSet lol. In one go I gave him all pro/contras of hashes as we discussed and he was impressed. Passed the interview. So it must be right. ;) – Houman Feb 17 '11 at 11:04

2 Answers2

4

1) In general, yes, a Dictionary<T> or HashSet<T> has constant time access. Locating an item in an unsorted List<T> or Array must be done linearly. Sorted collections let you do binary searches, giving O(log n) access time.

2) If you override GetHashCode in .NET, you should also override the Equals method. In .NET Dictionary and HashSet, you can't insert items that are equal. Hash collisions are unavoidable in the general case (unless you have computed a perfect hash). There are several ways to resolve collisions.

3) For more information about collision resolution, see http://en.wikipedia.org/wiki/Hash_table.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • particularly in .net collisions are resolved by having linked list attached to bucket – Andrey Feb 07 '11 at 15:39
  • Many Thanks for your answer. Further down I have written a comment to Steven's answer, which could very well be also asked to you. :) Since you mentioned the perfect hash, would I achieve it by using a 100% unique primary key from DB? And does the hash collision fall under the developers responsibility or is it being taken care of automatically anyway? – Houman Feb 07 '11 at 15:43
  • 1
    Imagine that you have 1,000 unique keys in your database and your hash table can hold any 100 of those keys. The hash code that you create will be mapped by the hash table into one of those 100 slots. So even if your hash codes are unique, you can have collisions in the hash table. Minimally perfect hashing only works when there is a one-to-one mapping of hash codes to slots in the hash table. It's the developer's responsibility to define a hash function that gives a reasonably uniform distribution, but collision resolution is the responsibility of the hash table implementation. – Jim Mischel Feb 07 '11 at 15:58
1

A hash table is a data structure. More information can be found when looking for more general information.

1) A default search in lists are linear (all elements need to be traversed). Perfect hashing (no collisions) allows for constant time lookups in the worst case. More collisions result in a slower lookup.

2) Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. Therefore, most hash table implementations have some collision resolution strategy to handle such events. .NET's Hashtable implementation seems to use double hashing.

3) This is something you shouldn't worry about, as long as you provide proper hash codes. When interested, read the wiki article about hash tables, which explains several techniques.

UPDATE: There is a difference in the implementation of Hashtable and Dictionaries in collision handling. Apparently Hashtable is obsolete and Dictionary or HashSet is preferred.

As Jim Mischel mentions, you should override GetHashCode as well as Equals. Inserting items which are equal isn't possible, but items with the same hashcode are handled by the collection type you choose.

Community
  • 1
  • 1
Steven Jeuris
  • 18,274
  • 9
  • 70
  • 161
  • Many Thanks for your answer. Realistically if I base my GetHashCode() on a primary key field retrieved from DB, wouldn't I bring the changes of a collision to Zero? But in case the hash could be duplicated after all, isn't it .NET taking care of rehashing/double hashing the values in case of a collision automatically? In the interview it sounded as if it was my responsibility to do something about it myself. :) Maybe they just wanted to hear about the double hashing being used internally, which I didn't say. – Houman Feb 07 '11 at 15:38
  • 1
    If the DB pk type is int and the dictionary only contains objects of that type than yes. (You would just return the pk field in the GetHAshCode function). But most of the time you need a good hashing function, see http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode/3880895#3880895 – Magnus Feb 07 '11 at 15:54
  • I agree with @Magnus. As for taking care of collisions, it is just something to be aware of so you understand why a unique as possible hashing function is important. – Steven Jeuris Feb 07 '11 at 15:58
  • @Magnus: But that only reduces the chance of a collision to zero if there is a one-to-one mapping of DB primary keys to hash table slots. If the key space is larger than the size of the hash table, then there will be collisions. – Jim Mischel Feb 07 '11 at 23:34
  • 1
    When the actual load factor reaches the specified load factor of the Dictionary, the number of buckets in the Dictionary is automatically increased to the smallest prime number that is larger than twice the current number of Dictionary bucket. Using a pk from the DB (that is an int) would result in an optimal hashing function. – Magnus Feb 08 '11 at 10:50