10

I want to hash a set of integers such that the order of the integers don't have an effect on the computed hash value. i.e. H([32224,12232,564423]) == H([564423,32224,12232]).

The number of unique sets will be in the range of a few millions. Speed is very important, but I need to know the upperbound on collisions with a chosen approach.

Wikipedia has a good section on hashing vectors, but I don't understand the math behind it to confidently implement them in code. I would appreciate if someone can explain the math involved with some code. Ideally, I would like the final hash to be of 32 bits. If it's of any use - I'll be implementing this in Java.

Update: I'm specifically looking to avoid sorting of the integers in the set, because of performance reasons (operating on lots of such sets).

jeffreyveon
  • 13,400
  • 18
  • 79
  • 129

5 Answers5

7

A simple approach is to xor or add the hashes of the individual integers together. xor and add are commutative, so this satisfies order independence.

Thus:

int hc = 0;
for(int i = 0; i < n; i++) {
   hc += a[i];
}
return hc;

or

int hc = 0;
for(int i = 0; i < n; i++) {
   hc ^= a[i];
}
return hc;

because the hash code of an int is its value anyway.

In fact, this is exactly whatHashSet<Integer>.hashCode (uses add) will do. If your integers are already boxed, or you can handle boxing them, that's a built-in solution.

jason
  • 236,483
  • 35
  • 423
  • 525
  • I considered XOR-ing. The issue is that I'm pretty sure, there will be lots of collisions. E.g. {1, 1, 2} and {2} will hash to the same value under XOR. So I was wondering if there is a better way to do this. – jeffreyveon Aug 02 '13 at 16:28
  • @jeffreyveon: Try `add`. – jason Aug 02 '13 at 16:28
  • 1
    xor is worse than add bc if the numbers are within a certain range, but big sets, add will spread them further. – Antti Haapala -- Слава Україні Aug 02 '13 at 16:31
  • My guess is that `add` will be better, but I will try this out to see how many collisions I get. – jeffreyveon Aug 02 '13 at 16:32
  • @jeffreyveon: That'd be my suspicion too. – jason Aug 02 '13 at 16:32
  • @jeffreyveon you just cited an example with {1,1,2}. Will your sets have duplicates? If so, then the xor method will effectively nullify duplicates. H({X, X, Y}) == H({Y}). Also, you should note that both methods (add and xor) will ignore `0` since there is no affect of adding or xor'ing `0`. You could add a base to each value when computing the hash to mitigate these problems. – Trenin Aug 02 '13 at 16:51
2

You can put all the Integers in Java HashSet and use its hashCode.

On the other hand, java.util.Set does specify the following in documents:

Returns the hash code value for this set. The hash code of a set is defined to be the sum of the hash codes of the elements in the set, where the hash code of a null element is defined to be zero. This ensures that s1.equals(s2) implies that s1.hashCode()==s2.hashCode() for any two sets s1 and s2, as required by the general contract of Object.hashCode().

And Integer.hashCode() is then

a hash code value for this object, equal to the primitive int value represented by this Integer object.

Thus the hashCode for set of integers i1, i2, ... i_n in Java standard library is i1 + i2 + ... + i_n.

In case that the numbers are rather small, you can also multiply each element by a some suitably sized prime number. Knuth used 2654435761 which is too big for an java int, but you can take its 2-complement, -1640531527. Thus take C = -1640531527, and then your code is C*i1 + C*i2 + ... C*i_n.

private static final int C = -1640531527;

public static int calculateHash(int[] set) {
    int code = 0;
    for (int e: set) {
        code += C * e;
    }

    return code;
}

However there is 1 obvious flaw in the thinking. To make any use of the hashCode, you need to be able to prove that 2 sets are indeed equal, thus in any case the easiest way to prove is to sort the elements. Of course if there are substantially less than millions of sets then there are not that many collisions either.

2

Assuming you need speed without the overhead of the *Set classes, then you could write H as follows:

/**
 * Hashes a set of integers.
 * 
 * @param list to hash
 * @return hash code
 */
public static int H(int list[]) {
    // XOR all the integers together.
    int hashcode = 0;
    for (int val : list) {
        hashcode ^= val;
    }
    return hashcode;
}

It is the same regardless of the order, and it is relatively efficient.

For example:

public static void main(String[] args) {
    System.out.println(Integer.toHexString(H(new int[]{0xabcd,0x1234,0x1111})));
    System.out.println(Integer.toHexString(H(new int[]{0x1234,0x1111,0xabcd})));
}

Displays:

a8e8
a8e8

This could be generalized to more than just ints by doing the following:

/**
 * Hashes a set of objects.
 * 
 * @param list to hash
 * @return hash code
 */
public static int H(Object list[]) {
    // XOR all the hashes together.
    int hashcode = 0;
    for (Object val : list) {
        hashcode ^= val.hashCode();
    }
    return hashcode;
}

The main program would then have to use arrays of Integer instead of the primitive int.

Adding the numbers should be almost as quick, and might give you a better distribution over the 32 bit range. If the elements of the set are already uniformly distributed over the range, then xor might be better.

However, with both methods, you can manufacture collisions easily with integers. For example, with the adding method;

{1000, 1001, 1002}
{0, 1, 3002}

Both of these arrays have the same H().

With the XOR method;

{0x1010, 0x0101}
{0x1111, 0x0000}

Both of these have the same H().

Similarly, the 0 element is problematic since lists will have the same hash with or without it. You can mitigate this by adding a constant value at each iteration. For example:

            ...
            hashcode += val.hashCode() + CONSTANT;
            ...

Or by including the number of elements as the initial hash code:

            ...
            // XOR all the hashes together.
            int hashcode = list.length;
            ...
Trenin
  • 2,041
  • 1
  • 14
  • 20
  • The first one (without `hashCode()`) is a very bad idea, since lots of confusable pairs XOR to the same value. For example, `10003 ^ 10004` = 7 = `100003 ^ 100004`, so if you add or remove a zero, which is a very common mistake, the first hash function above won't tell you and it should tell you. – Chai T. Rex Sep 04 '18 at 19:19
2

I'd prefer summation rather then xoring, because 1) sum is used in Set's hashCode() implementation, 2) sum as approach to array hashing is recommended in Effective Java 3) it's less collision-prone. I suggest you to look at openjdk's AbstractSet implementation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/AbstractSet.java?av=f

120    public int hashCode() {
121        int h = 0;
122        Iterator<E> i = iterator();
123        while (i.hasNext()) {
124            E obj = i.next();
125            if (obj != null)
126                h += obj.hashCode();
127        }
128        return h;
129    }

I'd also recommend making h long, and return (int) ((h & 0xffffffffL) & h >>> 32))

tkroman
  • 4,811
  • 1
  • 26
  • 46
  • Sum is just a side character in the typical `hashCode` implementation: the main point is that the interim result is first *multiplied* by a prime, and only then the next number is added. – Marko Topolnik Aug 02 '13 at 17:21
  • @MarkoTopolnik, that's not my implementation, but OpenJDK's. Anyway, `int`'s hashcode is equal to that int, so I see no reason to multiply it for just being present in array. – tkroman Aug 02 '13 at 18:04
  • Your point under 2) involves multiplication by a prime. Sets are a different thing due to their unorderedness. – Marko Topolnik Aug 02 '13 at 18:07
  • I'm afraid I didn't get your point. Could you please elaborate on that? – tkroman Aug 02 '13 at 18:20
  • In Effective Java, Josh Bloch recommends calculating the `hashCode` of an array as if all array elements were fields of an object. And that procedure involves repeated multiplication of the interim result with a prime number. – Marko Topolnik Aug 02 '13 at 18:24
  • I totally agree with you, that's true for all valid Objects, but as long as Java's int's hashCode is equal to that int's value, `int`s itself are exceptional in that sense, I recon. – tkroman Aug 02 '13 at 18:31
  • @Marko Topolnik: That standard way mentioned in Effective Java isn't order-independent. – jason Aug 02 '13 at 18:33
  • @cdshines Hm. Are you familiar with the standard rulse for `hashCode` generation? Your last comment doesn't make sense minding those rules. – Marko Topolnik Aug 02 '13 at 18:35
  • @Marko Topolnik: I meant to address comment to cdshines, not you. Terribly sorry. – jason Aug 02 '13 at 18:41
  • 1
    @MarkoTopolnik, what's wrong? `Integer x = 1234567; x.hashCode() == x`. If we apply hashing rules recursively, we are going to treat ints as Objects, but there is no need to do that because rules for hashing ints are already clearly defined. Anyway, what is wrong in my comment? – tkroman Aug 03 '13 at 10:30
  • I am talking about *array hashing* which you mention in your point 2, not about the hashing of a single integer. – Marko Topolnik Aug 03 '13 at 10:36
  • I understand that. I thought (because you mentioned recursive application of rules) that you suggest implementing other integer hashing rules. But what about array hashing? Is summation method unacceptable? – tkroman Aug 03 '13 at 10:38
  • 2
    Array hashing, as recommended in Effective Java, does not involve just summation, but the multiplication of interim results with a prime. Therefore it is not a good starting point for this requirement. I am not commenting on the viability of pure summation. – Marko Topolnik Aug 03 '13 at 10:45
0

This is by no means trivial programming, but you could take inspiration from DES algorithm's S-boxes: by this you can achieve a good dispersion function which maps similar integers to very dissimilar ones. Then XOR-ing these dissimilar integers should no longer present a threat due to collisions.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • The OP writes about millions of sets, so when using an `int`, there'll be (by the birthday paradox) many collisions (even for a perfect hash). For dispersing, something like [Hashing.smear](https://github.com/google/guava/blob/master/guava/src/com/google/common/collect/Hashing.java#L50) or the old (no more used) [HashMap.hash](http://hg.openjdk.java.net/jdk6/jdk6/jdk/file/tip/src/share/classes/java/util/HashMap.java#l264) should do. – maaartinus Jun 07 '17 at 13:55