8

The default implementation of hashCode() on HotSpot returns a random value and stores it in the object header. This doesn't seem to have changed in Java 8 where the hash value is calculated by a call to os::random():

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
...

I'm wondering why hashCode() constantly returns the same value, also after shutting down the JVM which I tried by executing the simple test below, restarting my machine and then running main() again.

public class SimpleTest {
  public static void main(String[] args) {
    Object obj = new Object();
    // This calls toString() which calls hashCode() which calls os::random()
    System.out.println(obj);
  }
}

How can the output be the same everytime if hashCode() is actually os::random()?


java -version gives

java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

Note:

Should someone ask themselves what System.out.println(obj);, which calls obj.toString() if the object is non-null and produces something like java.lang.Object@659e0bfd, has to do with hashCode(): The part after @ is the object's hash code in hexadecimal (and is unrelated with the object's location in memory, contrary to what the documentation suggests, which has led to misunderstandings).

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
  • 1
    `os::random()` would be deterministic. It must always be started with the same initial seed. – user207421 Mar 26 '15 at 22:34
  • Ah, I assumed that the seed is the current time in milliseconds (like it is the case with Java's `Random`) – Matthias Braun Mar 26 '15 at 22:38
  • The seed of the method is initially `1`. Assuming it is not changed, the first value will always be `(16807 * 1) mod (2 ** 31-1)`. Then the next seed is based off of that and so on. I don't know enough of the JVM internals to actually determine if this is the entire reason, or if there is more to it. – Obicere Mar 26 '15 at 22:58
  • Why is it not random? "Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin" – Steve Kuo Mar 26 '15 at 23:11
  • With 1.8 i get the same. with 1.7 seems to be one of two, changes every 5 seconds or so 4d905742 or 6cd9c6e2 – tgkprog Mar 27 '15 at 10:30

3 Answers3

6

Deterministic behavior makes code easier to debug because it can be replicated. So implementations tend to choose that where possible. Imagine how hard it would be to replicate some unit test that failed due to a mishandling a hash collision (say, after a hash was reduced in length) if the hashes were different every time.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 2
    That doesn't seem like a great reason; if you're writing tests that depend on the behaviour of `.hashCode()`, you should be testing on objects that have explicitly defined a `.hashCode()` method. Relying on `Object.hashCode()` to remain stable just invites problems, particularly with upgrading JVM versions. If anything, this is an argument for explicitly making `Object.hashCode()` volatile between runs. – dimo414 Mar 26 '15 at 23:06
  • Where did I say anyone could or should ever rely on it remaining stable? If you make it volatile though, you might have a test that fails but then you can't replicate that failure, making debugging much more difficult. – David Schwartz Mar 26 '15 at 23:08
  • Sure, but you're suggesting the reason it's stable is to aid in testing. I just don't see that as being a very good reason for this behaviour; people who know better won't take advantage of it, and people who don't will be confused why their tests pass now but their code breaks later. – dimo414 Mar 26 '15 at 23:11
  • But `Object.hashCode()` expressly doesn't document that users can expect the method to return the same value over multiple runs. Therefore even if `Object.hashCode()` *happens* to be deterministic now, callers can't rely on that behavior. Therefore, it would be a mistake to write tests that assume this. Anyone testing hash collisions or the like should implement their own `.hashCode()` method, and not rely on what `Object.hashCode()` may or may not do. – dimo414 Mar 26 '15 at 23:19
  • @dimo414 Repeatable behavior is simply better than code that fails in irreproducible ways. Yes, someone who goes to the trouble to implement it themselves can get exactly what they want, whatever that behavior is. The idea is to give the person who doesn't bother the best outcome -- and repeatability is better. If someone finds a test case that makes code do something wrong, having it reproduce that problem when you run the case again is better than being unable to reproduce the problem. – David Schwartz Mar 26 '15 at 23:23
  • I'm not saying you're suggesting anything, but I'm disagreeing with your statement that "it's superior because it makes the behavior repeatable". The behavior OP has come across is only repeatable in certain circumstances; the worst kind of repeatable :) – dimo414 Mar 26 '15 at 23:24
  • I just realized what I think our disconnect was; I interpreted your answer as a benefit for *Java programmers*, not JVM programmers. Someone programming in Java cannot trust this behavior to be deterministic, but someone developing the JVM itself can, of course. – dimo414 Mar 27 '15 at 04:23
5

To answer your question, we first have to ask the secondary question, "Why is os::random() seeded with a fixed seed?"

As @DavidSchwartz suggested, having a "random" number generator with a fixed seed is very useful, as it gives you arbitrary but deterministic behavior. The JVM developers can call os::random() and still know the behavior of the JVM isn't dependent on any external factors. Among other benefits, this means JVM tests are repeatable; using a "properly" seeded RNG would make it difficult to reproduce failures related to the RNG.

Now we can answer the original question, rephrased as "Why does HotSpot's implementation of Object.hashCode() use os::random()?"

The answer to this question is likely simply because it's easy, and it works. Hash codes need to be well-distributed, something an RNG provides. The simplest, most accessible RNG in this area of the JVM is os::random(). Since Object.hashCode() provides no guarantees about the source of these values, it doesn't matter that os::random() isn't really random at all.

You'll notice that this is only one possible hashing strategy, several others are defined (and chosen by the hashCode global), including one which they will "likely make ... the default in future releases."

Ultimately, this is just an implementation detail. There is simply no need to more aggressively randomize Object.hashCode(), and it's entirely possible other JVMs don't do it this way, or other operating systems behave differently. In fact, in Eclipse I see different hash codes when running your code repeatedly. Furthermore, the contract for Object.hashCode() suggests that typical JVM implementations don't implement Object.hashCode() this way at all:

This is typically implemented by converting the internal address of the object into an integer


Note also that your test only verifies that the first call to .hashCode() is consistent. In any sort of multi-threaded program you could not expect this behavior. It also relies on nothing else in the JVM calling os::random() during execution, which it could do at any time (for instance, if the garbage collector relies on os::random() the result of .hashCode() calls after the first GC will be non-deterministic).

dimo414
  • 47,227
  • 18
  • 148
  • 244
1

There's no advantage to having different values between executions. Indeed it increases the chance of difficult to reproduce bugs.

What is important is that the probability that two objects are assigned the same hash-code is low during a single execution.

If someone has found a seed that results in a run of values that takes a long time to produce a repeat (or a long run with very few repeats) then it makes sense to start with that every time.

I haven't checked the source of Hotspot to determine if there's been some effort to pick a 'good' seed. But the point is the aim here is to have a good spread. Randomness is not required and variation from execution to execution is at best useless at worst unhelpful.

Persixty
  • 8,165
  • 2
  • 13
  • 35