Sometimes the best way to know is to just run some brute force tests on your ranges. Ultimately though, you can always write a hash function and go back and fix it later if your getting poor performance. Premature optimization is evil. Still, it's easy to test hashing.
I ran this program and got 0 collisions:
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
public class Testing {
public static void main(String[] args) {
int minX = 0;
int minY = 100000;
int maxX = 20;
int maxY = 2000000;
Map<Integer, Integer> hashToCounts = new HashMap<Integer, Integer>();
for (int x = minX; x < maxX; x++) {
for (int y = minY; y < maxY; y++) {
int hash = hash(x, y);
Integer count = hashToCounts.get(hash);
if (count == null)
count = 0;
hashToCounts.put(hash, ++count);
}
}
int totalCollisions = 0;
for (Entry<Integer, Integer> hashCountEntry : hashToCounts.entrySet())
if (hashCountEntry.getValue() > 1)
totalCollisions += hashCountEntry.getValue() - 1;
System.out.println("Total collisions: " + totalCollisions);
}
private static int hash(int x, int y) {
return 7 + y * 31 + x * 23;
}
}
And the output:
Total collisions: 0
Note that my function was 7 + y * 31 + x * 23
.
Of course, don't take my word for it. Mess with the ranges to tweak it to your data set and try calculating it yourself.
Using your (y * 31) ^ x
gave me:
Total collisions: 475000
And using just x * y
:
Total collisions: 20439039
Be warned that this program can use a pretty good chunk of memory and computing power. I ran it on a pretty powerful server. I have no idea how it'll run on a local machine.
Some good rules to follow for hashing are:
- Mix up your operators. By mixing your operators, you can cause the results to vary more. Using simply
x * y
in this test, I had a very large number of collisions.
- Use prime numbers for multiplication. Prime numbers have interesting binary properties that cause multiplication to be more volatile.
- Avoid using shift operators (unless you really know what you're doing). They insert lots of zeroes or ones into the binary of the number, decreasing volatility of other operations and potentially even shrinking your possible number of outputs.