24

In the Javadoc for Object.hashCode() it states

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java™ programming language.)

It's a common miconception this has something to do with the memory address but it doesn't as that can change without notice and the hashCode() does not and must not change for an object.

@Neet Provided a link to a good answer https://stackoverflow.com/a/565416/57695 but I am looking for more details.


Here is an example to illustrate my concern

Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
theUnsafe.setAccessible(true);
Unsafe unsafe = (Unsafe) theUnsafe.get(null);

for (int t = 0; t < 10; t++) {
    System.gc();
    Object[] objects = new Object[10];
    for (int i = 0; i < objects.length; i++)
        objects[i] = new Object();

    for (int i = 0; i < objects.length; i++) {
        if (i > 0) System.out.print(", ");
        int location = unsafe.getInt(objects, Unsafe.ARRAY_OBJECT_BASE_OFFSET + Unsafe.ARRAY_OBJECT_INDEX_SCALE * i);
        System.out.printf("%08x: hc= %08x", location, objects[i].hashCode());
    }
    System.out.println();
}

prints

eac00038: hc= 4f47e0ba, eac00048: hc= 2342d884, eac00058: hc= 7994d431, eac00068: hc= 19f71b53, eac00078: hc= 2e22f376, eac00088: hc= 789ddfa3, eac00098: hc= 44c58432, eac000a8: hc= 036a11e4, eac000b8: hc= 28bc917c, eac000c8: hc= 73f378c8
eac00038: hc= 30813486, eac00048: hc= 729f624a, eac00058: hc= 3dee2310, eac00068: hc= 5d400f33, eac00078: hc= 18a60d19, eac00088: hc= 3da5f0f3, eac00098: hc= 596e0123, eac000a8: hc= 450cceb3, eac000b8: hc= 4bd66d2f, eac000c8: hc= 6a9a4f8e
eac00038: hc= 711dc088, eac00048: hc= 584b5abc, eac00058: hc= 3b3219ed, eac00068: hc= 564434f7, eac00078: hc= 17f17060, eac00088: hc= 6c08bae7, eac00098: hc= 3126cb1a, eac000a8: hc= 69e0312b, eac000b8: hc= 7dbc345a, eac000c8: hc= 4f114133
eac00038: hc= 50c8c3b8, eac00048: hc= 2ca98e77, eac00058: hc= 2fc83d89, eac00068: hc= 034005e1, eac00078: hc= 6041f871, eac00088: hc= 0b1df416, eac00098: hc= 5b83d60d, eac000a8: hc= 2c5a1e6b, eac000b8: hc= 5083198c, eac000c8: hc= 4f025f9f
eac00038: hc= 00c5eb8a, eac00048: hc= 41eab16b, eac00058: hc= 1726099c, eac00068: hc= 4240eca3, eac00078: hc= 346fe350, eac00088: hc= 1db4b415, eac00098: hc= 429addef, eac000a8: hc= 45609812, eac000b8: hc= 489fe953, eac000c8: hc= 7a8f6d64
eac00038: hc= 7e628e42, eac00048: hc= 7869cfe0, eac00058: hc= 6aceb8e2, eac00068: hc= 29cc3436, eac00078: hc= 1d77daaa, eac00088: hc= 27b4de03, eac00098: hc= 535bab52, eac000a8: hc= 274cbf3f, eac000b8: hc= 1f9fd541, eac000c8: hc= 3669ae9f
eac00038: hc= 772a3766, eac00048: hc= 749b46a8, eac00058: hc= 7e3bfb66, eac00068: hc= 13f62649, eac00078: hc= 054b8cdc, eac00088: hc= 230cc23b, eac00098: hc= 1aa3c177, eac000a8: hc= 74f2794a, eac000b8: hc= 5af92541, eac000c8: hc= 1afcfd10
eac00038: hc= 396e1dd8, eac00048: hc= 6c696d5c, eac00058: hc= 7d8aea9e, eac00068: hc= 2b316b76, eac00078: hc= 39862621, eac00088: hc= 16315e08, eac00098: hc= 03146a9a, eac000a8: hc= 3162a60a, eac000b8: hc= 4382f3da, eac000c8: hc= 4a578fd6
eac00038: hc= 225765b0, eac00048: hc= 17d5176d, eac00058: hc= 26f50154, eac00068: hc= 1f2a45c7, eac00078: hc= 104b1bcd, eac00088: hc= 330e3816, eac00098: hc= 6a844689, eac000a8: hc= 12330301, eac000b8: hc= 530a3ffc, eac000c8: hc= 45eee3fb
eac00038: hc= 3f9432e0, eac00048: hc= 1a9830bc, eac00058: hc= 7da79447, eac00068: hc= 04f801c4, eac00078: hc= 363bed68, eac00088: hc= 185f62a9, eac00098: hc= 1e4651bf, eac000a8: hc= 1aa0e220, eac000b8: hc= 385db088, eac000c8: hc= 0ef0cda1

As a side note; If you look at this code

if (value == 0) value = 0xBAD ;

It appears that 0xBAD is twice as likely as normal as any hashCode as 0 is mapped to this value. If you run this long enough you see

long count = 0, countBAD = 0;
while (true) {
    for (int i = 0; i < 200000000; i++) {
        int hc = new Object().hashCode();
        if (hc == 0xBAD)
            countBAD++;
        count++;
    }
    System.out.println("0xBAD ratio is " + (double) (countBAD << 32) / count + " times expected.");
}

prints

0xBAD ratio is 2.0183116992481205 times expected.
Community
  • 1
  • 1
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    Has been asked before, see here: http://stackoverflow.com/questions/557574/what-is-native-implementation-in-java/565416#565416 – Neet Dec 13 '12 at 12:52
  • 1
    @Neet Since the OP was one of the people answering that other question, I would tend to think he is asking something different. – assylias Dec 13 '12 at 12:52
  • I changed the link to the correct answer which was not answered by the OP. It explains how `hashCode()` works. – Neet Dec 13 '12 at 12:54
  • @neet The problem I had with that question is the accepted answer is IMHO incorrect. My answer says what it is not, but not what it is. The most voted for answer appears to be better. – Peter Lawrey Dec 13 '12 at 12:54
  • Internal address is the address in memory. A java developer would be unfamiliar with this, since pointers do not exist in Java. – Michael Ozeryansky Dec 13 '12 at 12:55
  • @MichaelOzeryansky My whole point is it is not the memory address. nor does it change when the address does. – Peter Lawrey Dec 13 '12 at 12:56
  • 1
    @Peter okay, then nevermind :-). @Michael every object you pass around gets passed as a `pointer` ... otherwise the `NullPointerException` won't make sense^^ – Neet Dec 13 '12 at 12:56
  • @neet or NPE was given it's name in the early days when naming wasn't as consistent as it is today. ;) – Peter Lawrey Dec 13 '12 at 12:57
  • @neet, I think there are two answers in the questions you pointed to which are basically right. It would be good to have more details as the memory addresses are reused in the eden space on every GC. This would cause a lot of collisions in hashCodes if this were the only thing used. – Peter Lawrey Dec 13 '12 at 12:59
  • 3
    The docs says "typically implemented". All the internals of the JVM don't need to make sense to the developer. It will work is all they tell you. So most likely when the object is instantiated it will have a unique identifier stored based from the address, which stays with the object wherever. – Michael Ozeryansky Dec 13 '12 at 13:00
  • *The docs says "typically implemented"*, no single impl. I know of does that. – bestsss Jan 10 '13 at 19:45
  • 1
    variants of this questions have been asked multiple times (usually regarding hashCode): http://stackoverflow.com/questions/4930781/how-do-hashcode-and-identityhashcode-work-at-the-back-end or this: http://stackoverflow.com/questions/10917731/does-hashcode-return-the-memory-address – bestsss Jan 10 '13 at 19:48
  • @bestsss They are good answers, but I was looking for more low level details e.g NPE's answer. – Peter Lawrey Jan 11 '13 at 08:36
  • Peter, in one of my answer there is a link to the impl., here is another link to explanation of the object header bits (i.e. biased locking/hashing) by Cliff Click: http://www.azulsystems.com/blog/cliff/2010-01-09-biased-locking – bestsss Jan 11 '13 at 10:12

4 Answers4

23

This is clearly implementation-specific.

Below I include the Object.hashCode() implementation used in OpenJDK 7.

The function supports six different calculation methods, only two of which take any notice of the object's address (the "address" being the C++ oop cast to intptr_t). One of the two methods uses the address as-is, whereas the other does some bit twiddling and then mashes the result with an infrequently-updated random number.

Of the remaining methods, one returns a constant (presumably for testing), one returns sequential numbers, and the rest are based on pseudo-random sequences.

It would appear that the method can be chosen at runtime, and the default seems to be method 0, which is os::random(). The latter is a linear congruential generator, with an alleged race condition thrown in. :-) The race condition is acceptable because at worst it would result in two objects sharing the same hash code; this does not break any invariants.

The computation is performed the first time a hash code is required. To maintain consistency, the result is then stored in the object's header and is returned on subsequent calls to hashCode(). The caching is done outside this function.

In summary, the notion that Object.hashCode() is based on the object's address is largely a historic artefact that has been obsoleted by the properties of modern garbage collectors.

// hotspot/src/share/vm/runtime/synchronizer.hpp

// hashCode() generation :
//
// Possibilities:
// * MD5Digest of {obj,stwRandom}
// * CRC32 of {obj,stwRandom} or any linear-feedback shift register function.
// * A DES- or AES-style SBox[] mechanism
// * One of the Phi-based schemes, such as:
//   2654435761 = 2^32 * Phi (golden ratio)
//   HashCodeValue = ((uintptr_t(obj) >> 3) * 2654435761) ^ GVars.stwRandom ;
// * A variation of Marsaglia's shift-xor RNG scheme.
// * (obj ^ stwRandom) is appealing, but can result
//   in undesirable regularity in the hashCode values of adjacent objects
//   (objects allocated back-to-back, in particular).  This could potentially
//   result in hashtable collisions and reduced hashtable efficiency.
//   There are simple ways to "diffuse" the middle address bits over the
//   generated hashCode values:
//

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = intptr_t(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
  } else
  if (hashCode == 2) {
     value = 1 ;            // for sensitivity testing
  } else
  if (hashCode == 3) {
     value = ++GVars.hcSequence ;
  } else
  if (hashCode == 4) {
     value = intptr_t(obj) ;
  } else {
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ;
     t ^= (t << 11) ;
     Self->_hashStateX = Self->_hashStateY ;
     Self->_hashStateY = Self->_hashStateZ ;
     Self->_hashStateZ = Self->_hashStateW ;
     unsigned v = Self->_hashStateW ;
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
     Self->_hashStateW = v ;
     value = v ;
  }

  value &= markOopDesc::hash_mask;
  if (value == 0) value = 0xBAD ;
  assert (value != markOopDesc::no_hash, "invariant") ;
  TEVENT (hashCode: GENERATE) ;
  return value;
}
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 1
    I was browsing the source but could not find it - what file is that in? – assylias Dec 13 '12 at 13:08
  • 3
    `hotspot/src/share/vm/runtime/synchronizer.hpp` – NPE Dec 13 '12 at 13:08
  • 1
    Thanks - for the record, link to openjdk7: http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/9b0ca45cd756/src/share/vm/runtime/synchronizer.cpp - it does not seem to have changed much... – assylias Dec 13 '12 at 13:10
  • Any idea which one it uses, or how you might change it? – Peter Lawrey Dec 13 '12 at 13:10
  • 2
    @PeterLawrey: It looks like `hashCode` is a runtime flags (not yet sure how to set it). The default *seems* to be 0. – NPE Dec 13 '12 at 13:15
  • Note: default [looks to be 5](http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/tip/src/share/vm/runtime/globals.hpp#l1148) nowadays. – eis Aug 28 '17 at 14:51
  • and you can use `-XX:hashCode=2` to set it. – eis Aug 28 '17 at 14:55
5

It typically is the memory address of the object. However, the first time the hashcode method is called on an object, the integer is stored in the header of that object so that the next invocation will return the same value (as you say, compacting garbage collection can change the address). To my knowledge that is how it is implemented in the Oracle JVM.

EDIT: digging down into the JVM source code, this is what shows up (synchronizer.cpp):

// hashCode() generation :
//
// Possibilities:
// * MD5Digest of {obj,stwRandom}
// * CRC32 of {obj,stwRandom} or any linear-feedback shift register function.
// * A DES- or AES-style SBox[] mechanism
// * One of the Phi-based schemes, such as:
//   2654435761 = 2^32 * Phi (golden ratio)
//   HashCodeValue = ((uintptr_t(obj) >> 3) * 2654435761) ^ GVars.stwRandom ;
// * A variation of Marsaglia's shift-xor RNG scheme.
// * (obj ^ stwRandom) is appealing, but can result
//   in undesirable regularity in the hashCode values of adjacent objects
//   (objects allocated back-to-back, in particular).  This could potentially
//   result in hashtable collisions and reduced hashtable efficiency.
//   There are simple ways to "diffuse" the middle address bits over the
//   generated hashCode values:
//

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = intptr_t(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
  } else
  if (hashCode == 2) {
     value = 1 ;            // for sensitivity testing
  } else
  if (hashCode == 3) {
     value = ++GVars.hcSequence ;
  } else
  if (hashCode == 4) {
     value = intptr_t(obj) ;
  } else {
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ;
     t ^= (t << 11) ;
     Self->_hashStateX = Self->_hashStateY ;
     Self->_hashStateY = Self->_hashStateZ ;
     Self->_hashStateZ = Self->_hashStateW ;
     unsigned v = Self->_hashStateW ;
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
     Self->_hashStateW = v ;
     value = v ;
  }

  value &= markOopDesc::hash_mask;
  if (value == 0) value = 0xBAD ;
  assert (value != markOopDesc::no_hash, "invariant") ;
  TEVENT (hashCode: GENERATE) ;
  return value;
}

So 6 different ways to do it in the Oracle JVM, one of them is equivalent to what I said... The value is stored in the header of the object by the method calling get_next_hash (that one is called FastHashCode and is called from the native version of Object.hashCode().

Mathias Schwarz
  • 7,099
  • 23
  • 28
  • 1
    Objects are first created in the eden space, and everytime a minor/full GC is run the eden space is reused with the same addresses. If the hashCode used the memory address they would repeat a lot. – Peter Lawrey Dec 13 '12 at 13:07
  • Whether it is the address depends on which of the 6 hashcode algorithms is chosen. Really, all that is needed is a way to generate integers that do not overlap too much and do it fast... – Mathias Schwarz Dec 13 '12 at 13:26
  • I don't really know why so many people are voting me down for my answer. I will leave this discussion now. I have better things to do than fighting over who found some random source code first... – Mathias Schwarz Dec 13 '12 at 13:29
  • @MathiasSchwarz, It is possible that they decided to upvote NPE answer and remove the vote thay have given to you. It is not about who found example first but how does it added a value to it. You just posted to code NPE pointed some thing out. – Damian Leszczyński - Vash Dec 13 '12 at 13:45
2

IMHO, although implementation dependent, Object references in JVM are never actual memory addresses, but internal JVM references that point to actual addresses. These internal references ARE generated initially based on memory address, but they remain associated with the object until it is discarded.

I say this because Java HotSpot GCs are copying collectors of some form or the other, and they work by traversing live objects and copying them from one heap to the other, destroying the old heap subsequently. So when GC occurs, the JVM does not have to update the references in all objects, but just change the actual memory address mapped to the internal reference.

Subhash
  • 3,121
  • 1
  • 19
  • 25
0

The Javadoc's suggestion for Object.hashCode() that it is implemented by deriving the hashcode from the memory address is simply outdated.

Probably nobody bothered (or noticed) that this implementation path is no longer possible whith a copying garbage collector (Since it would change the hashcode when the Object is copied to another memory location).

It made a lot of sense to implement hashcode that way before there was a copying garbage collector, because it would save heap space. A non-copying GC (CMS) may still implement hashcode that way today.

Durandal
  • 19,919
  • 4
  • 36
  • 70