I have 232 4-byte keys that I'm hashing; what's the chance of collision?
What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?
I have 232 4-byte keys that I'm hashing; what's the chance of collision?
What if I have 264 8-byte keys (not really storing every key, but I want to know the worst case)?
Per the wikipedia page on the Birthday Problem, a good first order approximation can be found with 1-e^(-(n^2)/d)
. Graphing this for your values gives this graph (logarithmic horizontal axis, I've zoomed in on where the probability starts to spike). Note that this is only an approximation, and should be considered conservatively (ie, the real probability may be somewhat higher, but it should be in the right ballpark).
What are you doing with the hash codes? If you're using them to work out whether two pieces of data are the same, an MD5 hash is pretty good, though only if you are working with data that is not being created by malicious entities. (Cryptographic purposes need better hash algorithms precisely in order to deal with the "malicious attacker" problem.)
If you're using them for building a map (i.e., you're building a hash table) it's usually better to use a cheap hash and come up with a way to mitigate the costs of collision (e.g., by hanging a linked list off the hash table and resizing/rebuilding when the average weight gets too large).