2

I'm mapping x,y values onto a Cartesian plane with a HashMap. What would be an effective HashCode for very small x, very large y values?

currently I am using :

 public int hashCode() {
    return ((y * 31) ^ x);

 // & Typical x,y values would be, (with many collisions on x):
  [4, 1000001] [9, 1000000] [5, 999996] [6, 999995] [4, 999997] 
  [6, 999997] [6, 1000003] [10, 999994] [8, 999997] [10, 999997] 
  [5, 999999] [4, 999998] [5, 1000003] [2, 1000005] [3, 1000004] 
  [6, 1000000] [3, 1000005]

I am inserting both x,y pairs into the key of a hashmap with a .put method, to avoid any duplicate x,y pairs. Not sure if that is the most effective solution either.

Berzerker
  • 145
  • 2
  • 11
  • Could you guarantee that the values don't exceed 2^63-1? I'd watch out for very large y values like that. – Makoto Nov 10 '12 at 01:57

3 Answers3

3

Sometimes the best way to know is to just run some brute force tests on your ranges. Ultimately though, you can always write a hash function and go back and fix it later if your getting poor performance. Premature optimization is evil. Still, it's easy to test hashing.

I ran this program and got 0 collisions:

import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

public class Testing {

    public static void main(String[] args) {
        int minX = 0;
        int minY = 100000;
        int maxX = 20;
        int maxY = 2000000;

        Map<Integer, Integer> hashToCounts = new HashMap<Integer, Integer>();
        for (int x = minX; x < maxX; x++) {
            for (int y = minY; y < maxY; y++) {
                int hash = hash(x, y);
                Integer count = hashToCounts.get(hash);
                if (count == null)
                    count = 0;
                hashToCounts.put(hash, ++count);
            }
        }

        int totalCollisions = 0;
        for (Entry<Integer, Integer> hashCountEntry : hashToCounts.entrySet())
            if (hashCountEntry.getValue() > 1)
                totalCollisions += hashCountEntry.getValue() - 1;

        System.out.println("Total collisions: " + totalCollisions);
    }

    private static int hash(int x, int y) {
        return 7 + y * 31 + x * 23;
    }
}

And the output:

Total collisions: 0

Note that my function was 7 + y * 31 + x * 23.

Of course, don't take my word for it. Mess with the ranges to tweak it to your data set and try calculating it yourself.

Using your (y * 31) ^ x gave me:

Total collisions: 475000

And using just x * y:

Total collisions: 20439039

Be warned that this program can use a pretty good chunk of memory and computing power. I ran it on a pretty powerful server. I have no idea how it'll run on a local machine.

Some good rules to follow for hashing are:

  • Mix up your operators. By mixing your operators, you can cause the results to vary more. Using simply x * y in this test, I had a very large number of collisions.
  • Use prime numbers for multiplication. Prime numbers have interesting binary properties that cause multiplication to be more volatile.
  • Avoid using shift operators (unless you really know what you're doing). They insert lots of zeroes or ones into the binary of the number, decreasing volatility of other operations and potentially even shrinking your possible number of outputs.
Brian
  • 17,079
  • 6
  • 43
  • 66
0

Seems like x * y would work well, especially if the result will fit in an int.

You could use a HashSet: it's internally a HashMap with only keys, no values. It would make the intention to avoid duplicates more evident.

Skip Head
  • 7,580
  • 1
  • 30
  • 34
0

It's hard to predict. The HashMap does some rehashing using the hash() method shown below, then takes the bottom X bits. So, in an ideal world, ignoring the hash() method which stirs things up, your least significant bits should be well distributed.

static int hash(int h) {
  // This function ensures that hashCodes that differ only by
  // constant multiples at each bit position have a bounded
  // number of collisions (approximately 8 at default load factor).
  h ^= (h >>> 20) ^ (h >>> 12);
  return h ^ (h >>> 7) ^ (h >>> 4);
}

I usually start with something really simple, e.g. x^y (or x shifted by something ^ y or vice versa), and create the HashMap, and see if there are too many collisions.

user949300
  • 15,364
  • 7
  • 35
  • 66