3

Why not using hashCode() under the hood of equals() to pre-check for equality first?

Quick draft tests:

@Fork(value = 1)
@Warmup(time = 1)
@Measurement(time = 1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)

public class Main {

  @Param({
    "o", // differ size
    "oooooooooooooooooo1", // same size, differ last symbol
    "oooooooooooooooooo2" // same content
  })
  String string1;

  @Param({
    "oooooooooooooooooo2"
  })
  String string2;

  @Benchmark
  public void stringEquals(Blackhole bh) {
    bh.consume(string1.equals(string2));
  }

  @Benchmark
  public void myEquals(Blackhole bh) {
    bh.consume(myEquals(string1, string2));
  }

  boolean myEquals(String str1, String str2){
    if (str1.hashCode()==str2.hashCode()) {
      return str1.equals(str2);
    }
    return false;
  }
}

Results:

Benchmark                    (string1)            (string2)  Mode  Cnt   Score   Error  Units
Main.myEquals                        o  oooooooooooooooooo2  avgt    5   5.552 ± 0.094  ns/op
Main.myEquals      oooooooooooooooooo1  oooooooooooooooooo2  avgt    5   5.626 ± 0.173  ns/op
Main.myEquals      oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  14.347 ± 0.234  ns/op
Main.stringEquals                    o  oooooooooooooooooo2  avgt    5   6.441 ± 1.076  ns/op
Main.stringEquals  oooooooooooooooooo1  oooooooooooooooooo2  avgt    5  13.596 ± 0.348  ns/op
Main.stringEquals  oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  13.663 ± 0.126  ns/op

As you can see we got great speedup for the case of "same size, differ last symbol".

I think under the hood of String.equals() check for hashCode() equality should replace check for length() equality as it takes the same time:

  @Benchmark
  public void emptyTest(Blackhole bh) {
    bh.consume(0);
  }

  @Benchmark
  public void stringLength(Blackhole bh) {
    bh.consume(string2.length());
  }

  @Benchmark
  public void stringHashCode(Blackhole bh) {
    bh.consume(string2.hashCode());
  }

Benchmark                      (string2)  Mode  Cnt  Score   Error  Units
Main.emptyTest       oooooooooooooooooo2  avgt    5  3.702 ± 0.086  ns/op
Main.stringHashCode  oooooooooooooooooo2  avgt    5  4.832 ± 0.421  ns/op
Main.stringLength    oooooooooooooooooo2  avgt    5  5.175 ± 0.156  ns/op

PS I have a feeling my measurements method might be wrong, so any comments are welcome. Also, the hash is saved inside String and that also might produce some misleading results...

UPD1: As @AdamSiemion mentioned we need to recreate string every time a benchmarked method is called to avoid cashing of hash code:

  String str1, str2;

  @Setup(value = Level.Invocation)
  public void setup(){
    str1 = string1;
    str2 = string2;
  }

  @Benchmark
  public void stringEquals(Blackhole bh) {
    bh.consume(str1.equals(str2));
  }

  @Benchmark
  public void myEquals(Blackhole bh) {
    bh.consume(myEquals(str1, str2));
  }

Benchmark                    (string1)            (string2)  Mode  Cnt   Score   Error  Units
Main.myEquals                        o  oooooooooooooooooo2  avgt    5  29.417 ± 1.430  ns/op
Main.myEquals      oooooooooooooooooo1  oooooooooooooooooo2  avgt    5  29.635 ± 2.053  ns/op
Main.myEquals      oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  37.628 ± 0.974  ns/op
Main.stringEquals                    o  oooooooooooooooooo2  avgt    5  29.905 ± 2.530  ns/op
Main.stringEquals  oooooooooooooooooo1  oooooooooooooooooo2  avgt    5  38.090 ± 2.933  ns/op
Main.stringEquals  oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  36.966 ± 1.642  ns/op

So, we still have almost 30% speedup for the case of "same size, differ last symbol".

UPD2 As @DanielPryden mentioned str1 = string1 will not create new String. So we need to explicitly do so:

  @Setup(value = Level.Invocation)
  public void setup(){
    str1 = new String(string1);
    str2 = new String(string2);
  }

Benchmark                    (string1)            (string2)  Mode  Cnt   Score   Error  Units
Main.myEquals                        o  oooooooooooooooooo2  avgt    5  61.662 ± 3.068  ns/op
Main.myEquals      oooooooooooooooooo1  oooooooooooooooooo2  avgt    5  85.761 ± 7.766  ns/op
Main.myEquals      oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  92.156 ± 8.851  ns/op
Main.stringEquals                    o  oooooooooooooooooo2  avgt    5  30.789 ± 0.731  ns/op
Main.stringEquals  oooooooooooooooooo1  oooooooooooooooooo2  avgt    5  38.602 ± 1.212  ns/op
Main.stringEquals  oooooooooooooooooo2  oooooooooooooooooo2  avgt    5  38.921 ± 1.816  ns/op

So, now we have what was expected: using hashCode() will always be slower then equals(). And that has a total sense (as @Carcigenicate mentioned in comments below): hashCode() need to do full traversal through char[] to produce the hash. I thought it might be some intrinsic under the hood of hashCode() that make it faster, but it has not.

Therefore, it's still possible to get some speed up of equals() if make a check for precalculated hash existence and compare them:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = value.length;
        if (n == anotherString.value.length
           // new code begins
            && (hash==0 || anotherString.hash==0 || hash==anotherString.hash)) {
           // new code ends
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

We'll get some small(?) slowdown in case of equal strings (for checking hash fields), but also will get a speedup in case of strings with the same length but different content and already precalculated hashes.

Unfortunately, I can't test it as I can't change the source code of String class.

  • 3
    I know nothing about the benchmarking tool being used here, but I would think that it shouldn't make a difference, or potentially even be more expensive to pre-check hashes. The String hash function, afaik, requires a full iteration of the string, then the `equals` check would likely require a *second* iteration. You'd be doing a full iteration just to see if you need to do another iteration. Unless the hash has been previously calculated already. – Carcigenicate Jan 06 '19 at 22:06
  • 2
    Because identical hashcodes would only establish they might be equal. `System.out.println("FB".hashCode() == "Ea".hashCode());`, and then you would have to test further to determine if they actually are. – Elliott Frisch Jan 06 '19 at 22:07
  • A much fairer test would be to generate a large number of random pairs of strings with just the last character changed. I suspect you'd get a different result in that case. – sprinter Jan 06 '19 at 22:11
  • It is just a choice, most probably Oracle/Sun believed such kind of benefit is not actually useful in real-life. Reminded me a question in the past, asking why the hashCode of String is calculated as of now, and it is going to give a lot of collision in some special cases etc. https://stackoverflow.com/a/9407032/395202 . What make you think optimizing for "same length but different last character" and sacrificing other natural text is a good design choice? – Adrian Shum Jan 07 '19 at 02:54
  • I think any chance to avoid heavy per-symbol comparing for **no extra cost in some cases** will benefit String class. – Artsiom Chapialiou Jan 07 '19 at 03:04
  • 1
    @ArtsiomChapialiou: Is your `setup()` function intending to create new string objects? Because it most definitely is not doing that. You will need to use the `new String` constructor to get a new string object. I agree that your benchmark is flawed here: what you are showing is that you get a *small* speedup in the case where the string's hashcode is already cached *and* the strings are almost, but not quite, identical. The most common cases where `equals()` is called, the hash code is not yet precomputed and the strings are either identical or else have more than one character differing. – Daniel Pryden Jan 07 '19 at 11:11
  • @DanielPryden `new String()` is redundant. Strings are immutable, i.e. it is copied not by reference but by creating a new string. So `str1 = string1` will create new String with identical content of string1. Also getting 30% speedup in some cases for **no extra cost** in other cases, will definitely benefit String class. – Artsiom Chapialiou Jan 07 '19 at 18:00
  • 2
    @ArtsiomChapialiou: That is incorrect. The semantics of reference types in Java do not behave differently for mutable or immutable objects. An assignment operation using the `=` operator *never* creates a new object (with the exception of a boxing conversion, which doesn't apply here). `str1 = string1` means that you now have two *variables* which both refer to the *same* String object. All types that are subtypes of `java.lang.Object` are reference types. A variable of type `String` is *not* a string, but a reference *to* a string. – Daniel Pryden Jan 07 '19 at 18:08
  • 1
    @ArtsiomChapialiou: More importantly, your own benchmarks refute your assertion of *no extra cost*: in the case where the strings are equal, your implementation is 5% slower. If in a given application it would be expected that strings could be compared as equal 20 times for every time that they were unequal, then your implementation would lose regardless of how much faster it made the comparison. And I expect the margin would be substantially more than 5% in a large number of real-world situations involving very large strings (thousands or millions of characters). – Daniel Pryden Jan 07 '19 at 18:25
  • @DanielPryden copying the String will always create new String. That easy to check: `String str1 = "a"; String str2 = str1; str2 = str2 + "b";` System.out.println(str1); After executing that code you'll se that str1 still "a". Also redundancy of `new String()` suggested by IDEA inspections: if you try to write `str2 = new String(str1);` "...new String(anotherString) (equivalent to anotherString)..." – Artsiom Chapialiou Jan 08 '19 at 03:38
  • As for 5% slowdown for equivalent Strings, we'll have 30% speedup for other cases, that's fair tradeoff... Also I can't change the original String class implementation. So, I can do only external hashCode checks in my tests which is definitely less effitient then doing so from inside the String class. – Artsiom Chapialiou Jan 08 '19 at 03:45
  • And by the way, 5% is inside measurement error: 37.628 ± 0.974 is equivalent to 36.966 ± 1.642 – Artsiom Chapialiou Jan 08 '19 at 03:54
  • A better test for new String creation not referencing will be: `String str1 = "a"; String str2 = str1; str1 = "b"; System.out.println(str2);` Print `a` – Artsiom Chapialiou Jan 08 '19 at 04:25
  • 2
    @ArtsiomChapialiou: I'm not going to argue with you here. I suggest you open a new question about string copying. You have a fundamental misunderstanding: you are confusing *assignment* and *mutation*. I would be happy to write an answer explaining what's actually going on, but this comment space is too small for that. – Daniel Pryden Jan 08 '19 at 12:21
  • @DanielPryden You're right and I've been wrong. Thank you for your time to educate me! See UPD2. – Artsiom Chapialiou Jan 08 '19 at 19:59
  • 2
    To further complicate the situation the jvm(s) usually have intrinsics for the String equals code. Meaning the jvm basically "swaps" in a platform optimized version instead of the code you see there. It started to use SSE 4.2 instructions if available on the cpu for example for this back in 2009 or so. See https://bugs.openjdk.java.net/browse/JDK-6761600 – Mattias Isegran Bergander Jan 08 '19 at 21:17

2 Answers2

1

Your performance tests calling hashCode() thousands of times (using jmh) do not make sense because String hash code is cached:

/** Cache the hash code for the string */
private int hash; // Default to 0

public int hashCode() {
  int h = hash;
  if (h == 0 && value.length > 0) {
    char val[] = value;

    for (int i = 0; i < value.length; i++) {
      h = 31 * h + val[i];
    }
    hash = h;
  }
  return h;
}

So once String hash code is computed, calling hashCode() has almost no cost - contrary to the majority of the Java classes which recompute the hash code every time hashCode() is called.

Usually it is equals() which is faster than hashCode() as it is usually uses a short-circuit evaluation. For example, if you have a call with 10 fields, and the values in first fields of the two provided instances differ equals() will not inspect the remaining 9 fields, while hashCode() (usually) be computed from all the 10 fields.

Adam Siemion
  • 15,569
  • 7
  • 58
  • 92
  • Can you elaborate or provide a link for further study regarding the line: `as it is usually uses a short-circuit evaluation.` please? – Paul Nikonowicz Jan 06 '19 at 22:33
  • 2
    @PaulNikonowicz I believe he means that an entire iteration of the fields (or characters in a String's case) isn't done unless necessary; that it uses an early exit. [Here](https://stackoverflow.com/a/15585595/3000206), see the line `if (v1[i++] != v2[j++]) return false;`. – Carcigenicate Jan 06 '19 at 22:56
  • 1
    @PaulNikonowicz let's use the `String` class as an example - if the characters at the first position are different then there is no point in looping through all the remaining characters - `equals()` returns `false` - the Strings are not equal. While `hashCode()` to compute the hash code needs to loop through all the characters. – Adam Siemion Jan 06 '19 at 23:21
  • @AdamSiemion Thanks for pointing on cashed hashes. I thought jmh by default create new params for each Benchmark method invocation. Looks like `Level.Invocation)` might help to do that, just a bit scared of it Javadoc... String str1, str2; @Benchmark public void stringEquals(Blackhole bh) { bh.consume(str1.equals(str2)); } @Benchmark public void myEquals(Blackhole bh) { bh.consume(myEquals(str1, str2)); } @Setup(value = Level.Invocation) public void setup(){ str1 = string1; str2 = string2; } – Artsiom Chapialiou Jan 07 '19 at 02:28
  • Uphph... What an inconvinient comments system SO has... See update in question. – Artsiom Chapialiou Jan 07 '19 at 02:36
  • @AdamSiemion Shouldn't it be a common case for the String to be created once and then compared many times? In that case, using the hash code (if for any reason it was computed before) for an additional check before comparing per symbol would save time. Addition for String.equal() after length check: `if (hash !=0 && hash == anotherString.hash) //then proceed to per-symbol compare` – Artsiom Chapialiou Jan 07 '19 at 02:51
  • 1
    Correct check will be: `if (hash!=0 && anotherString.hash!=0 && hash==anotherString.hash)` – Artsiom Chapialiou Jan 07 '19 at 03:00
  • `(hash==0 || anotherString.hash==0 || hash==anotherString.hash)` would be proper condition. Previous one loose the cases when any of hashes are 0's but we need to proceed father char[] traversal. – Artsiom Chapialiou Jan 08 '19 at 20:05
1

I agree that comparing the hashCode (only if already calculated) looks like it might boost up performance, as String objects are immutable.

Considerations:

  1. The boost is only for when the hashCode already exists (was already calculated). If the hashCode wasn't already calculated, calculating might take longer than comparing the chars of 2 strings. This because, when comparing, we can stop as soon as we see a difference. For example, when comparing "aaxxxxxxxxxx" to "aazzzzzzzzzzz", we can stop after the second char if the strings aren't equal. But calculating hashCode will need a traversal over all chars.

  2. Perhaps the writers' decision was based on stats regarding how Strings are used. They might have seen that the additional comparison of the hashCode might slow the system.

    For example, if most strings are used withing hash maps/tables, than the hashCode is already compared and used. All the strings left to compare have the same hashCode, so there's no need to compare the hashCodes again.

  3. The hash field might be calculated in several threads simultaneously for the same object, especially if equals() uses it. This needs to be taken into consideration.

  4. Another consideration is the memory usage. Perhaps there's an optimization in the JVM to not use memory if an int field is zero? Once it's not zero, might it raise the memory consumption?

It would have been nice to have a way to tweak and measure this (String is final). Perhaps using some bytecode manipulation or using a different classloader...

Here's the code tweak for what was suggested (OpenJDK and Oracle look the same):

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }

    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = value.length;
        if (n == anotherString.value.length) {

            // THE HASHCODE TWEAK
            if (hash != 0 &&
                anotherString.hash != 0 &&
                hash == anotherString.hash)
            {
                return true;
            }

            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
AlikElzin-kilaka
  • 34,335
  • 35
  • 194
  • 277