NSString hash collision doesn't mess up NSDictionary

Question

Run this code in the debugger and stop afterward (I'm in the iOS 7 simulator in Xcode 5.0.1).

NSString *nsStr = @"/Users/123456789012/Library/Application Support/iPhone Simulator/7.0/Applications/94F31827-6DAD-4BD5-AC91-B215176265E1/Documents/libraries/shared/17517_abPub/OEBPS/indexes/Index.sqlite";
NSString *nsStr2 = @"/Users/123456789012/Library/Application Support/iPhone Simulator/7.0/Applications/94F31827-6DAD-4BD5-AC91-B215176265E1/Documents/libraries/shared/16583_abPub/OEBPS/indexes/Index.sqlite";
NSUInteger form1 = [nsStr hash];
NSUInteger form2 = [nsStr2 hash];

NSMutableDictionary *dict = [[[NSMutableDictionary alloc]init]autorelease];
[dict setObject:@"foo" forKey:nsStr];

id foobar = [dict objectForKey:nsStr2];

Note that form1 and form2 are the same. We have a hash collision. Also note that foobar is nil. The hash collision does not faze NSDictionary. Why is this? Does anyone know what Apple is doing to survive hash collision in their dictionary / what some good strategies are for this?

EDIT: For reference, here is more detail on NSString hashing. Apparently the method only looks at the first, middle, and last 32 characters; anything else in the string doesn't matter.

Why did you expect that it would break? Even simple dictionary implementations need to deal with hash collisions. — Wain, Oct 09 '13 at 20:16
Right. My experience with dictionary implementations used better hashes like MD5, so I guess I was expecting the NSString hash to be better. — Tom Hamming, Oct 09 '13 at 20:24
Is your expectation that the hash of the key object is what's used as the actual key? — jscs, Oct 09 '13 at 20:24
@JoshCaswell I was unthinkingly assuming that, yes. But it makes sense that you can solve hash collisions easily and thus enable optimization of the hash algorithm for speed over uniqueness. — Tom Hamming, Oct 09 '13 at 20:25
If you are really curious, you can inspect the implementation: https://www.opensource.apple.com/source/CF/CF-476.14/CFDictionary.c. — Martin R, Oct 09 '13 at 20:25
A hash code is an integer, it may have for example 4 bytes. A string long 1000 characters is 2KB if it contains all unichars. How is supposed the hash to map all the possible values of the strings? A hash code cannot be unique for all strings, it is used to distribute (that is for example, to choose in which list to put the string inside the dictionary, if it is implemented with an array of lists), not for unique identification. — Ramy Al Zuhouri, Oct 09 '13 at 20:35
I'm not getting why this is downvoted. This is a perfectly valid question. — Guillaume Algis, Oct 09 '13 at 21:38

score 7 · Accepted Answer · answered Oct 09 '13 at 20:16

A hash value is not unique. Many different values can have the same hash value. The implementation of the dictionary knows this. The use of the hash is simply a way to optimize the lookup. But it's the actual key value that matters, not the hash value.

As long as any two objects that return YES to isEqual: also have the same hash, things are fine.

Think of a dictionary as a series of buckets. The hash determines which bucket the value is in. Once you know the bucket, you still need to look for the exact key.

NSString hash collision doesn't mess up NSDictionary

1 Answers1

Linked