The book Numerical Recipes offers a method to calculate 64bit hash codes in order to reduce the number of collisions.
The algorithm is shown at http://www.javamex.com/tutorials/collections/strong_hash_code_implementation_2.shtml and is copied here for reference:
private static final createLookupTable() {
byteTable = new long[256];
long h = 0x544B2FBACAAF1684L;
for (int i = 0; i < 256; i++) {
for (int j = 0; j < 31; j++) {
h = (h >>> 7) ^ h;
h = (h << 11) ^ h;
h = (h >>> 10) ^ h;
}
byteTable[i] = h;
}
return byteTable;
}
public static long hash(CharSequence cs) {
long h = HSTART;
final long hmult = HMULT;
final long[] ht = byteTable;
final int len = cs.length();
for (int i = 0; i < len; i++) {
char ch = cs.charAt(i);
h = (h * hmult) ^ ht[ch & 0xff];
h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
}
return h;
}
My questions:
1) Is there a formula to estimate the probability of collisions taking into account the so-called Birthday Paradox?
2) Can you estimate the probability of a collision (i.e two keys that hash to the same value)? Let's say with 1,000 keys and with 10,000 keys?
EDIT: rephrased/corrected question 3
3) Is it safe to assume that a collision of a reasonable number of keys (say, less than 10,000 keys) is so improbable so that if 2 hash codes are the same we can say that the keys are the same without any further checking? e.g.
static boolean equals(key1, key2) {
if (key1.hash64() == key2.hash64())
return true; // probability of collision so low we don't need further check
return false;
}
This is not for security, but execution speed is imperative so avoiding further checks of the keys will save time. If the probability is so low, say less than (1 in 1 billion for 100,000 keys) it will probably be acceptable.
TIA!