58

I am looking at the implementation of HashMap in Java and am stuck at one point.
How is the indexFor function calculated?

static int indexFor(int h, int length) {
   return h & (length-1);
}

Thanks

jmiserez
  • 2,991
  • 1
  • 23
  • 34
gnreddy
  • 2,393
  • 4
  • 22
  • 18

5 Answers5

133

The hash itself is calculated by the hashCode() method of the object you're trying to store.

What you see here is calculating the "bucket" to store the object based on the hash h. Ideally, to evade collisions, you would have the same number of buckets as is the maximum achievable value of h - but that could be too memory demanding. Therefore, you usually have a lower number of buckets with a danger of collisions.

If h is, say, 1000, but you only have 512 buckets in your underlying array, you need to know where to put the object. Usually, a mod operation on h would be enough, but that's too slow. Given the internal property of HashMap that the underlying array always has number of buckets equal to 2^n, the Sun's engineers could use the idea of h & (length-1), it does a bitwise AND with a number consisting of all 1's, practically reading only the n lowest bits of the hash (which is the same as doing h mod 2^n, only much faster).

Example:

     hash h: 11 1110 1000  -- (1000 in decimal)
   length l: 10 0000 0000  -- ( 512 in decimal)
      (l-1): 01 1111 1111  -- ( 511 in decimal - it will always be all ONEs)
h AND (l-1): 01 1110 1000  -- ( 488 in decimal which is a result of 1000 mod 512)
James McMahon
  • 48,506
  • 64
  • 207
  • 283
Petr Janeček
  • 37,768
  • 12
  • 121
  • 145
  • 5
    Does it make sense now, or should I elaborate more on the internals? – Petr Janeček Jun 04 '12 at 10:16
  • 8
    _Very_ well explained. I'm impressed. – Louis Wasserman Jun 04 '12 at 11:29
  • 2
    Does this mean that a hash bucket could contain keys with different `hashCodes` if the lowest 9 or so bits agree, but the higher bits are different? – Paul Boddington Apr 15 '15 at 11:17
  • 1
    @pbabcdefp Yes, that's correct. That's one of the main reasons why a [good](http://stackoverflow.com/questions/2738886/what-is-a-best-practice-of-writing-hash-function-in-java) [hash](http://stackoverflow.com/questions/113511/hash-code-implementation) [function](http://stackoverflow.com/questions/2624192/good-hash-function-for-strings) multiplies it's result by a prime number - it tries to achieve very different hashes even when the values differ by a little. – Petr Janeček Apr 15 '15 at 11:25
  • 1
    @Slanec Thank you. This is one of the clearest answers I've seen on SO. (+1). – Paul Boddington Apr 15 '15 at 11:28
  • 1
    The best explanation of the bucket calculation on the web. Thanks a lot! – Max Gabderakhmanov Apr 18 '19 at 18:23
  • @PetrJaneček would mind explaining this? If L != 2^n? then L-1 will not have all the bits set to 1 right? Say L = 511 and not 512, then? – KJ Sudarshan Apr 24 '21 at 04:29
  • 1
    @KJ Sudarshan Exactly correct. And in that case the trick with `&` wouldn't work, but you could still use a normal modulo, `%`. The binary trickery is simply a performance trick. – Petr Janeček Apr 24 '21 at 08:24
  • I've always thought that each bucket held items with the same hashcode. There are a lot of articles out there that say this. But what you are saying here is that the hashcode gets a furthur map into buckets with mixed hashcodes. This makes more sense storagewise. – Humpity Nov 04 '21 at 10:42
42

It's not calculating the hash, it's calculating the bucket.

The expression h & (length-1) does a bit-wise AND on h using length-1, which is like a bit-mask, to return only the low-order bits of h, thereby making for a super-fast variant of h % length.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 3
    Can you please explain this calculation here ? – gnreddy Jun 04 '12 at 09:48
  • 3
    Does this assume that `length` is a power of 2? – LarsH Sep 14 '12 at 18:00
  • 2
    @LarsH Well, it would be far better if it was a power of 2, then you'd get a clean chop off of the high-order bits. As it happens, the implementation of HashMap starts with size 16 and does indeed multiply by two when resizing. It would still work if not a power of two, but you would want as many bits "on" as possible for `length -1` to balance the spread between buckets – Bohemian Sep 15 '12 at 02:07
  • So does it mean that in HashMap there is always 16 buckets (initial capacity)? Because if it is `h % length`, then no more buckets can be created. – LexSav Sep 14 '20 at 13:45
  • @lex I don’t understand your question, but the capacity (the quantity of buckets), which is stored in `length`, doubles if the buckets are “too full”. The algorithm relies on `length` always being a power of 2. `16` was chosen as using the Goldielocks principle (not to big and not too small) – Bohemian Sep 14 '20 at 17:01
  • @Bohemian When i try this `indexFor(18,7)` it return 2 but `18 % 7` return 4, why would you say it's `a super-fast variant of h % length` – uncle bob Aug 14 '21 at 03:22
  • 1
    @unclebob as per my previous comment, **the algorithm relies on `length` being a power of 2** (which is always the case for HashMap). 7 is not a power of 2. – Bohemian Aug 14 '21 at 03:27
4

The above answer is very good but I want to explain more why Java can use indexFor for create index

Example, I have a HashMap like this (this test is on Java7, I see Java8 change HashMap a lot but I think this logic still very good)

// Default length of "budget" (table.length) after create is 16 (HashMap#DEFAULT_INITIAL_CAPACITY)
HashMap<String, Integer> hashMap = new HashMap<>();
hashMap.put("A",1); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("B",2); // hash("B")=70, indexFor(hash,table.length)=70&(16-1) = 6
hashMap.put("P",3); // hash("P")=85, indexFor(hash,table.length)=85&(16-1) = 5
hashMap.put("A",4); // hash("A")=69, indexFor(hash,table.length)=69&(16-1) = 5
hashMap.put("r", 4);// hash("r")=117, indexFor(hash,table.length)=117&(16-1) = 5

You can see the index of entry with key "A" and object with key "P" and object with key "r" have same index (= 5). And here is the debug result after I execute code above

enter image description here

Table in the image is here

public class HashMap<K, V> extends AbstractMap<K, V> implements Map<K, V>, Cloneable, Serializable {
    transient HashMap.Entry<K, V>[] table;
    ...
}

=> I see
If index are different, new entry will add to table
If index is same and hash is same, new value will update
If index is same and hash is different, new entry will point to old entry (like a LinkedList). Then you know why Map.Entry have field next

static class Entry<K, V> implements java.util.Map.Entry<K, V> {
        ...
        HashMap.Entry<K, V> next;
}

You can verify it again by read the code in HashMap.

As now, you can think that HashMap will never need to change the size (16) because indexFor() always return value <= 15 but it not correct.
If you look at HashMap code

 if (this.size >= this.threshold ...) {
      this.resize(2 * this.table.length);

HashMap will resize table (double table length) when size >= threadhold

What is threadhold? threadhold is calculated below

static final int DEFAULT_INITIAL_CAPACITY = 16;
static final float DEFAULT_LOAD_FACTOR = 0.75F;
...
this.threshold = (int)Math.min((float)capacity * this.loadFactor, 1.07374182E9F); // if capacity(table.length) = 16 => threadhold = 12

What is the size? size is calculated below.
Of course, size here is not table.length .
Any time you put new entry to HashMap and HashMap need to create new entry (note that HashMap don't create new entry when the key is same, it just override new value for existed entry) then size++

void createEntry(int hash, K key, V value, int bucketIndex) {
    ...
    ++this.size;
}

Hope it help

Linh
  • 57,942
  • 23
  • 262
  • 279
2

It is calculating the bucket of the hash map where the entry (key-value pair) will be stored. The bucket id is hashvalue/buckets length.

A hash map consists of buckets; objects will be placed in these buckets based on the bucket id.

Any number of objects can actually fall into the same bucket based on their hash code / buckets length value. This is called a 'collision'.

If many objects fall into the same bucket, while searching their equals() method will be called to disambiguate.

The number of collisions is indirectly proportional to the bucket's length.

user207421
  • 305,947
  • 44
  • 307
  • 483
Ramesh PVK
  • 15,200
  • 2
  • 46
  • 50
-3

bucket_index = (i.hashCode() && 0x7FFFFFFFF) % hashmap_size does the trick