How to form String keys to get the most uniform hash code distribution

Question

I want to store a large number of objects in a HashMap. The key to identify each object is a String which is always made up of 3 parts/substrings, for simplicity I name them A, B and C. A has high variability, B average variability and C low variability. There are multiple ways for combining the parts:

key = A + "_" + B + "_" + C;
key = A + "_" + C + "_" + B;
key = B + "_" + A + "_" + C;
...

Primarily, I would like to know how the key should be built from the substrings that have different variability/randomness in order to get the most uniform hash code distribution. Should the most random bits come first, or at the end, or...?

Secondly, I'd like to know how the length of the key influences the time to get the object from the HashMap. For example, if I double the key length does the object retrieval take twice the time? Or does the calculation of the hash code only take a fraction of that time because the process of getting the object from the HashMap's buckets takes much longer?

Maybe some useful reading http://stackoverflow.com/questions/9364134/what-hashing-function-does-java-use-to-implement-hashtable-class — CollinD, Jan 06 '17 at 22:49
The cost/benefit here is so wildly lopsided that you should *just use whatever's most human-readable for the key*. — chrylis -cautiouslyoptimistic-, Jan 06 '17 at 23:14
It's impossible to predict the hash code distribution of arbitrary strings. Any order will do. — Andreas, Jan 06 '17 at 23:19
@Andreas true if you make no assumptions about the distribution of the sub-strings. However, given the fact that xpages-noob knows the variability of the sub-string distributions, he can use that knowledge to make a better hash function. See my answer below as an example. — Austin, Jan 06 '17 at 23:20
@AustinD You can't override the hash function of `String`. The keys are Strings, not some other object. — Andreas, Jan 06 '17 at 23:30

Austin · Answer 1 · 2017-01-06T23:24:49.803

1

Bottom line: You should use the standard hashCode method provided by the String class... but NOT because the order doesn't matter.

(In fact, if you had said that C had highest variability and A had lowest, then the performance of the java.lang.String.hashCode would be horrible!)

Take Away: Given addition information about the Object's members, the order of hashing has a substantial affect on the distribution of keys.

Normally, without any domain-specific knowledge, it's best to opt for readability and the reliability of well-established libraries for things like this. However, since you have specific insight into the distribution of your substrings, you can make a more informed decision regarding your hashFunction.

To demonstrate, suppose part A can take on any character value, part B takes on only the first 15 characters, and part C takes on only the first 5 characters. and suppose you override the hashCode method in the following way:

@Override
public int hashCode(){
    final int constant = 37;
    final String partA = getPartA(myString);
    final String partB = getPartB(myString);
    final String partC = getPartC(myString);
    int total = 17;
    total= total * constant + partA;
    total= total * constant + partB;
    total= total * constant + partC;

    return total;

}

We would expect a near uniform random distribution of strings from this method. However, if we were to reverse the following lines:

    total= total * constant + partC; //formerly part A
    total= total * constant + partB;
    total= total * constant + partA; //formerly part C

we would only generate codes in the first half of the value range. Here's some experimental results tested on 15,000 random strings that meet my stated assumptions above.

HashCode distribution when computed as A then B then C:

HashCode distribution when computed as C then B then A:

edited Jan 06 '17 at 23:24

answered Jan 06 '17 at 23:18

Austin

8,018
2
31
37

Whether you do ABC or CBA, the *concatenated* string will have same length, and calculating hashCode will take exactly the same amount of time. No performance difference, and hashcode distribution is very likely similar. – Andreas Jan 06 '17 at 23:24
@Andreas But we're not concatenating the entire string here... we're treating each substring independently. Even if we were treating it as a single string... given the stated distributions, we would still get better distributions by using `String.hashCode()` than using `new StringBuffer(string).reverse().toString().hashCode()` – Austin Jan 06 '17 at 23:25
Such a detailed answer in such a short time. Thanks and respect. – xpages-noob Jan 06 '17 at 23:27
@AustinD But that is not the question. The question is whether order of string parts matters to quality of hashcode distribution for concatenated string. You're *assuming* that you can do without concatenation, and you cannot, because [OP needs a concatenated string](http://stackoverflow.com/questions/41515601/how-to-form-string-keys-to-get-the-most-uniform-hash-code-distribution?noredirect=1#comment70237630_41515687). Your assumed premise is wrong. Would have been a great answer otherwise, though. – Andreas Jan 06 '17 at 23:28
@Andreas yes, but the behavior still holds in the concatenated form... it was just easier to demonstrate in the non-concatenated form. – Austin Jan 06 '17 at 23:29
@AustinD: I love that you directly tested it and plotted the histograms. Do you by chance have a mathematical explanation why higher variability in the first characters leads to a more uniform distribution? Does that happen because the first characters are multiplied by large factors [31^(n-i)] while the last characters are only multiplied by small ones? – xpages-noob Jan 06 '17 at 23:59
@AustinD [Proof that you're wrong](http://stackoverflow.com/a/41516484/5221149). Hashcode distribution doesn't care whether strings vary at the beginning or at the end. – Andreas Jan 07 '17 at 00:26

score 1 · Answer 2 · answered Jan 06 '17 at 23:26

Are you making the key just for the sake of using it in a HashMap? If so, then you don't even have to make it. You can put your object directly in a HashMap, but you must override the methods hashCode() and equals().

The good news is -- your IDE (e.g. Eclipse) can generate suggested code for hashCode() and equals() for you. (In Eclipse, Source>Generate hashCode() and equals() ...). You can take its suggestion from there.

See my example code below.

I tend to think the computation is really fast. But if you have concerns about the speed, and if the three fields/parts/substrings are immutable, then you can compute the hashCode in the constructor, as I have done in my example code.

The speed of accessing elements from a hashmap depends on the load factor (i.e. how full is the hashmap). If the hashmap is lightly loaded (most buckets has zero or one elements in it), you get almost constant time O(1) for access. If the hashmap is heavily loaded (most buckets has many elements), then the performance would slow down significantly.

Example Code

package StringKeyForHashMap;

import java.util.HashMap;
import java.util.Map;

public class Thing {
    private final String    a;
    private final String    b;
    private final String    c;
    private final int       hashCode;


    public Thing(String a, String b, String c) {
        super();
        this.a = a;
        this.b = b;
        this.c = c;
        this.hashCode = computeHashCode();
    }


    @Override
    public int hashCode() {
        return this.hashCode;
    }

    private int computeHashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((a == null) ? 0 : a.hashCode());
        result = prime * result + ((b == null) ? 0 : b.hashCode());
        result = prime * result + ((c == null) ? 0 : c.hashCode());
        return result;
    }


    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Thing other = (Thing) obj;
        if (a == null) {
            if (other.a != null)
                return false;
        } else if (!a.equals(other.a))
            return false;
        if (b == null) {
            if (other.b != null)
                return false;
        } else if (!b.equals(other.b))
            return false;
        if (c == null) {
            if (other.c != null)
                return false;
        } else if (!c.equals(other.c))
            return false;
        return true;
    }


    public static void main(String[] args) {
        /*
         * Below I assume that the value of interest is 
         * an integer
         */
        Map<Thing, Integer> map = new HashMap<>();  
        map.put(new Thing("AAA", "BBB", "CCC"), 0);
    }

}

Thanks for your answer. Since I have to exchange the keys with other non-Java systems I prefer to use a String key to identify the object. — xpages-noob, Jan 06 '17 at 23:35

score 1 · Answer 3 · answered Jan 07 '17 at 00:25

Whether a String has high variability at the beginning of the string vs at the end of the string doesn't matter.

To test this, the below code simulates the hash-table logic of Java 8's HashMap class. The methods tableSizeFor and hash were copied from the JDK source code.

The code will create 60 character strings that differ only by the first or last 7 characters. It will then build a hash-table with appropriate capacity and count the number of hash bucket collisions.

As can be seen in the output, the collision counts are the same (within statistical margins), regardless of leading or trailing variability of the strings being hashed.

Output

Count: 1000      Collisions: 384      By collision size: {1=240, 2=72}
Count: 1000      Collisions: 278      By collision size: {1=191, 2=30, 3=3, 4=3, 6=1}
Count: 100000    Collisions: 13876    By collision size: {1=12706, 2=579, 3=4}
Count: 100000    Collisions: 15742    By collision size: {1=12644, 2=1378, 3=110, 4=3}
Count: 10000000  Collisions: 2705759  By collision size: {1=1703714, 2=381705, 3=65050, 4=9417, 5=1038, 6=101, 7=3}
Count: 10000000  Collisions: 2626728  By collision size: {1=1698957, 2=365663, 3=56156, 4=6278, 5=535, 6=27, 7=4}

Test Code

public class Test {
    public static void main(String[] args) throws Exception {
        //
        test(1000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(1000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
        test(100000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(100000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
        test(10000000, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_%07d");
        test(10000000, "%07d_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
    }
    private static void test(int count, String format) {
        // Allocate hash-table
        final int initialCapacity = count * 4 / 3 + 1;
        final int tableSize = tableSizeFor(initialCapacity);
        int[] tab = new int[tableSize];

        // Build strings, calculate hash bucket, and increment bucket counter
        for (int i = 0; i < count; i++) {
            String key = String.format(format, i);
            int hash = hash(key);
            int bucket = (tableSize - 1) & hash;
            tab[bucket]++;
        }

        // Collect collision counts, i.e. counts > 1
        // E.g. a bucket count of 3 means 1 original value plus 2 collisions
        int total = 0;
        Map<Integer, AtomicInteger> collisions = new TreeMap<>();
        for (int i = 0; i < tableSize; i++)
            if (tab[i] > 1) {
                total += tab[i] - 1;
                collisions.computeIfAbsent(tab[i] - 1, c -> new AtomicInteger()).incrementAndGet();
            }

        // Print result
        System.out.printf("Count: %-8d  Collisions: %-7d  By collision size: %s%n", count, total, collisions);
    }
    static final int MAXIMUM_CAPACITY = 1 << 30;
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
}

Thanks for your answer. Unfortunately, each answer here states something different. I will evaluate the different suggestions and code examples and report back when I know more. — xpages-noob, Jan 07 '17 at 01:15

score 0 · Answer 4 · answered Jan 06 '17 at 22:55

0

The order doesn't effect the distribution of the hashkey. All characters have the same "weight".

The longer the key, the more time it takes to calculate hash, BUT String reuse the hashCode once it's created, therefore if you reuse the same String, hashCode is generated only once.

Having said that, I would suggest you change your implementation:

Create immutable class thats accepts A,B,C in constructor and calculate the hash value in constructor.
Make hashCode return the hash value from constructor.
If possible, reuse the instances of the class, so you don't need to recalculate the hashcode each time the map is accessed.
Don't forget to override equals.

Even if you don't reuse the object, it's a better approach since it encapsulates the hash logic. But the real benefit comes if the object is reused.

answered Jan 06 '17 at 22:55

Tarlog

10,024
2
43
67

Thanks for your answer. If I understand it correctly you're suggesting to use instances of the new immutable class as keys for the HashMap. If so, this won't really work for the project I need this for because the keys have to be strings as they get exchanged with other non-Java systems. – xpages-noob Jan 06 '17 at 23:06
Why not? You receive strings from external system and then create and object: new MyMapKey(A,B,C) – Tarlog Jan 06 '17 at 23:57
I know, but for simplicity I'd prefer to identify the object by the same single key on both system. – xpages-noob Jan 07 '17 at 00:02

How to form String keys to get the most uniform hash code distribution

4 Answers4