67

From the book CLRS ("Introduction to Algorithms"), there are several hashing functions, such as mod, multiply, etc.

What hashing function does Java use to map the keys to slots?

I have seen there is a question here Hashing function used in Java Language. But it doesn't answer the question, and I think the marked answer for that question is wrong. It says that hashCode() let you do your own hashing function for Hashtable, but I think it is wrong.

The integer returned by hashCode() is the real key for Hashtble, then Hashtable uses a hashing function to hash the hashCode(). What this answer implies is that Java give you a chance to give Hashtable a hashing function, but no, it is wrong. hashCode() gives the real key, not the hashing function.

So what exactly the hashing function does Java use?

Community
  • 1
  • 1
Jackson Tale
  • 25,428
  • 34
  • 149
  • 271
  • http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/Hashtable.java#Hashtable – SLaks Feb 20 '12 at 15:59
  • @SLaks, so "int index = (hash & 0x7FFFFFFF) % tab.length;" is the real hashing function? – Jackson Tale Feb 20 '12 at 16:08
  • possible duplicate of [Understanding strange Java hash function](http://stackoverflow.com/questions/9335169/understanding-strange-java-hash-function) – Raedwald Jan 21 '14 at 23:34
  • You may find this Stackoverflow Question answer: [Helpful link][1] [1]: http://stackoverflow.com/questions/13825546/a-faster-hash-function – Jeff Apr 13 '14 at 07:48

6 Answers6

111

When a key is added to or requested from a HashMap in OpenJDK, the flow of execution is the following:

  1. The key is transformed into a 32-bit value using the developer-defined hashCode() method.
  2. The 32-bit value is then transformed by a second hash function (of which Andrew's answer contains the source code) into an offset inside the hash table. This second hash function is provided by the implementation of HashMap and cannot be overridden by the developer.
  3. The corresponding entry of the hash table contains a reference to a linked list or null, if the key does not yet exist in the hash table. If there are collisions (several keys with the same offset), the keys together with their values are simply collected in a singly linked list.

If the hash table size was chosen appropriately high, the number of collisions will be limited. Thus, a single lookup takes only constant time on average. This is called expected constant time. However, if an attacker has control over the keys inserted into a hash table and knowledge of the hash algorithm in use, he can provoke a lot of hash collisions and therefore force linear lookup time. This is why some hash table implementations have been changed recently to include a random element that makes it harder for an attacker to predict which keys will cause collisions.

Some ASCII art

key.hashCode()
     |
     | 32-bit value
     |                              hash table
     V                            +------------+    +----------------------+
HashMap.hash() --+                | reference  | -> | key1 | value1 | null |
                 |                |------------|    +----------------------+
                 | modulo size    | null       |
                 | = offset       |------------|    +---------------------+
                 +--------------> | reference  | -> | key2 | value2 | ref |
                                  |------------|    +---------------------+
                                  |    ....    |                       |
                                                      +----------------+
                                                      V
                                                    +----------------------+
                                                    | key3 | value3 | null |
                                                    +----------------------+
Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • 3
    Yeah, this is a very good explanation for Hashtable. Very clear. – Jackson Tale Feb 20 '12 at 17:55
  • 3
    Hi Niklas, how do you draw this ASCII diagram? It looks really nice! – janetsmith Dec 21 '12 at 18:21
  • 1
    @janetsmith: Using a good text editor (VIM or EMACS, I don't recall) that supports block-based editing. – Niklas B. Dec 22 '12 at 01:29
  • 2
    I thought you were using this http://www.jave.de/ , its a nice image-ascii editor – janetsmith Dec 24 '12 at 04:23
  • I don't get that you mention chaining, altough when I use a Java HashMap and store another item at a given key it just gets overridden? – stevebot Apr 20 '14 at 06:32
  • @stevebot The chaining is to resolve collisions, where multiple objects hash to the same slot but are *not* equal. Remember that hash functions can never be injective – Niklas B. Apr 20 '14 at 06:34
  • Quick question, I'm confused how to map the 32 bit value to the address of the value, hopefully that's not the RAM address. Given a large hashtable as you said, will the computer protect the whole space for the table? Is the space continuous? What the relationship between the value address and hash value? – Albert Chen Feb 28 '15 at 01:23
  • @Albert not sure what you mean by address. The hash value modulo the table size is used to index the hash table slot where the item is held. It's an array index, not an address – Niklas B. Feb 28 '15 at 01:41
  • @NiklasB. So am I understanding right? Actually the hash table is a huge array, the reason to allocate that for a large space is reducing hash collision. Even with few useful data, the hash table is still very large. – Albert Chen Feb 28 '15 at 14:29
  • @AlbertChen No, the hash table size is typically chose to be within a constant factor of the number of elements it contains (usually between 1 and 1.5) – Niklas B. Feb 28 '15 at 20:49
  • who will be the attacker,what the attacker's purpose.and in which way the attacker can attack the program. cause i don't know what the benefit can the attacker get – asdf Oct 08 '18 at 08:08
  • 1
    @asdf Here I was referring to denial-of-service attacks like https://www.ruby-lang.org/en/news/2011/12/28/denial-of-service-attack-was-found-for-rubys-hash-algorithm-cve-2011-4815/, https://medium.com/@bamieh/nodejs-constant-hashtables-seeds-vulnerability-f03bf70e3593 etc – Niklas B. Oct 08 '18 at 10:02
35

According to hashmap's source(java version < 8), every hashCode is hashed using the following method:

 /**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

The reason every hashCode is hashed again is to further prevent a collision (see comments above)

HashMap also uses a method to determine the index of a hash code(java version < 8) (since length is always a power of 2, you can use & instead of %):

/**
 * Returns index for hash code h.
 */
static int indexFor(int h, int length) {
    return h & (length-1);
}

The put method looks something like:

int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);

The purpose of a hash code is to provide a unique integer representation for a given object. It makes sense, then, that Integer's hashCode method simply returns the value because each value would be unique to that Integer object.

Additional Ref:
HashMap for java8
HashMap for java11

dkb
  • 4,389
  • 4
  • 36
  • 54
Andrew Liu
  • 666
  • 4
  • 7
  • 15
    I don't think every hashCode is hashed again is just to further prevent collision. I think the hashing function is try to convert hashCode() value into the slot index for underlying array. – Jackson Tale Feb 20 '12 at 16:19
  • 1
    By the way, I think that this function has been randomized a bit in the meantime to prevent certain denial-of-service scenarios where an attacker could use the knowledge about the hash algorithm to provoke collisions. – Niklas B. Feb 20 '12 at 16:51
  • Hm, the above doesn't yet seem to be true for OpenJDK, nevermind. – Niklas B. Feb 20 '12 at 17:10
  • 1
    Quick question, could explain why people choose the hash function like that, and how can you guarantee there's few collision of this function. – Albert Chen Feb 28 '15 at 01:26
  • 1
    @Jackson Tale **hashCode() + hash() + indexFor()** together are try to map the an Object into a slot. BTW, look at HasnMap instead of HashTable. – hotpro May 01 '15 at 02:07
  • Does anyone know the name of this hashing algorithm and its properties? Is there is a 64 bit version? – QuantumKarl Apr 29 '16 at 12:32
4

Hashing in general is divided into two steps: a. HashCode b. Compressing

In step a. an integer corresponding to your key is generated. This can be modified by you in Java.

In step b. a compression technique is applied by Java to map the integer returned by step a. to a slot in the hashmap or hashtable. This compression technique cannot be changed.

3
/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This is the latest hash function used by hashMap class in java

Avinash
  • 33
  • 5
0

I think there is some confusion about the concept here. A hash function maps a variable-size input to a fixed-size output (the hash value). In the case of Java objects the output is a 32-bit signed integer.

Java's Hashtable use the hash value as an index into an array where the actual object is stored, taking modulo arithmetic and collisions into account. However, this is not hashing.

The java.util.HashMap implementation performs some additional bit swapping on the hash value before indexing to protect against excessive collisions in some cases. It is called "additional hash", but I don't think that is a correct term.

forty-two
  • 12,204
  • 2
  • 26
  • 36
  • 1
    "However, this is not hashing." Yes it is. The 32-bit integer is used as input to another a hash function specific to that hash table (depending on the size of the vector). – Niklas B. Feb 20 '12 at 16:45
  • @forty-two so you mean the real hashing function is in hashCode(), and another function which decides the slot index of the array is not called hashing, but just an extra step for the index? – Jackson Tale Feb 20 '12 at 16:46
  • @Jackson: See my comment above. Actually two hash functions are involved, the second of which maps an arbitrary 32-bit value to the exact range of the slot indices. – Niklas B. Feb 20 '12 at 16:47
  • @NiklasB.Yeah I saw it and I agree with you. But many people just think hashCode() can decide everything for Hashtable, which is true, I think – Jackson Tale Feb 20 '12 at 17:16
  • @NiklasB. Can you please write an answer for this question and I will mark yours as correct one – Jackson Tale Feb 20 '12 at 17:17
  • @NiklasB. Thanks for the answer. I actually like that it's not random and my HashMap is deterministic. Perhaps letting the developer choose is a better idea. I'm surprised to find out that DoS attacks can happen by exploiting hash collisions; fascinating. – Neo M Hacker May 24 '19 at 00:47
-1

To put it in a very simple way the second hashing is nothing but finding the index number of the bucket array where the new key-value pair will be stored. This mapping is done to get the index number from the bigger int value of the hashcode of the key obj. Now if two unequal key objects have same hash code then collision will happen as they will be mapped to the same array index. In this case the second key along with it's value will be added to the linked list. Here the array index will point to the last node added.