45

The code of the equals method in class String is

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                    return false;
            }
            return true;
        }
    }
    return false;
}

I have a question - why does this method not use hashCode() ?

As far as I know, hashCode() can compare two strings rapidly.

UPDATE: I know, that two unequal strings, can have same hashes. But two equal strings have equal hashes. So, by using hashCode(), we can immediately see that two strings are unequal.

I'm simply thinking that using hashCode() can be a good filter in equals.

UPDATE 2: Here some code, about we are talking here.

It is an example how String method equals can look like

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        if (hashCode() == anotherString.hashCode()){
            int n = count;
            if (n == anotherString.count) {
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = offset;
                int j = anotherString.offset;
                while (n-- != 0) {
                    if (v1[i++] != v2[j++])
                        return false;
                }
                return true;
            }
        }else{
            return false;
        }
    }
    return false;
}
Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
Mary Ryllo
  • 2,321
  • 7
  • 34
  • 53
  • 4
    Two completely different strings may have the same hashcode. A hashcode is represented by an int which has a max value of 2^31, does really *that much* different strings exist in the world so that they can be uniquely identified by a hashcode? – BalusC Jan 10 '13 at 16:21
  • 4
    It would make it fast if it has been set already (which it might not be) It's not hard to check so I wonder why too. – Peter Lawrey Jan 10 '13 at 16:22
  • 1
    @BalusC You would still have to do a full check for equal, or unset hashCodes. But if the hashCodes are set and different.... – Peter Lawrey Jan 10 '13 at 16:22
  • It would be less efficient in case of inequal strings and I bet that this occurs more often than comparing equal strings. – BalusC Jan 10 '13 at 16:25
  • 2
    If it had been implemented that way I'm pretty sure that the savings wouldn't have amounted to anything that anyone would have ever noticed and yet would require more testing in a clas that must be 100% rock-solid. You should never trade simplicity for performance until you are absolutely sure it will make a real difference. – Bill K Jan 11 '13 at 00:31
  • @BalusC: The unequal-string case would be expedited; it would be the equal-case that would be marginally slower (though since hash codes are cached, the difference should be slight). – supercat Jan 15 '13 at 23:30
  • I think this would only be useful if, in the typical case, comparing for equality was significantly more expensive than comparing the hash codes for equality. – Kevin Aug 12 '14 at 01:26

8 Answers8

40

Hashcode could be a first-round check for inequality. However, it presents some tradeoffs.

  1. String hashcodes are lazily calculated, although they do use a "guard" value. If you're comparing strings with long lifetimes (ie, they're likely to have had the hashcode computed), this isn't a problem. Otherwise, you're stuck with either computing the hashcode (potentially expensive) or ignoring the check when the hashcode hasn't been computed yet. If you have a lot of short-lived strings, you'll be ignoring the check more often than you'll be using it.
  2. In the real world, most strings differ in their first few characters, so you won't save much by checking hashcode first. There are, of course, exceptions (such as URLs), but again, in real world programming they occur infrequently.
parsifal
  • 1,645
  • 9
  • 6
  • 3
    "In the real world most strings differ in their first few characters, so you won't save much by checking hashcode first". That's just a weasel word and it could be confusing for beginners. http://en.wikipedia.org/wiki/Weasel_word – Sedat Kapanoglu Jan 10 '13 at 21:42
  • 6
    @ssg - no, it's a recognition that there are no absolutes, and that part of design is figuring out the common case and optimizing for it. And, of course, the corollary that you'll have to live with your decisions if you optimize for the uncommon case. Something that beginners should learn early. – parsifal Jan 10 '13 at 21:51
  • 2
    It just sounds like you present your personal opinion as a fact. Your assertion about what's common in the answer contradicts with your comment stating "one should figure out their own common case". – Sedat Kapanoglu Jan 10 '13 at 22:04
  • It is worth putting things in perspective: (i) it would only add 2 int assignements and at most 3 int comparisons - we are talking about less than 50 ms per million calls to equals (ii) but it is true that it only saves a loop when the strings have identical ending characters (the loop is backwards) (iii) and in the worst case, which is 2 equal strings, it does not help... – assylias Jan 10 '13 at 22:40
  • @ssg - s/opinion/experience/, in a wide range of applications. And as for "*one* should figure out *their own* common case," that's not something I said. The developers of a general-use API have to figure out the common case for general use. – parsifal Jan 11 '13 at 13:47
  • @ssg - and lest you take offense at me relying on my experience to say what's comment, I'll note that when it comes to `String`, I'm defering to the experience of Gosling, Bloch, Lea, et all. I'm simply applying my own experience to understanding the tradeoffs they made. – parsifal Jan 11 '13 at 14:08
  • 3
    @parsifal: Your statement in the answer doesn't sound like that you assumed facts based on implementation decisions. It rather sounds like you have a scientific, verified fact universally applicable to everything. That's the weasel part. I'm saying that it's misleading for beginners. It's dangerous to have such shortcut conclusions. I thought you were suggesting beginners to research but you said you didn't say that. So it's even a more dangerous statement now. – Sedat Kapanoglu Jan 11 '13 at 15:09
15

This question has actually been considered by the developers of the JDK. I could not find in the various messages why it has not been included. The enhancement is also listed in the bug database.

Namely, one of the proposed change is:

public boolean equals(Object anObject) {
    if (this == anObject) // 1st check identitiy
        return true;
    if (anObject instanceof String) { // 2nd check type
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) { // 3rd check lengths
            if (n != 0) { // 4th avoid loading registers from members if length == 0
                int h1 = hash, h2 = anotherString.hash;
                if (h1 != 0 && h2 != 0 && h1 != h2) // 5th check the hashes
                    return false;

There was also a discussion to use == for interned strings (i.e. if both strings are interned: if (this != anotherString) return false;).

assylias
  • 321,522
  • 82
  • 660
  • 783
7

1) Calculating hashCode may not be faster than comparing the Strings directly.

2) if the hashCode is equal, the Strings may still not be equal

MrSmith42
  • 9,961
  • 6
  • 38
  • 49
  • 1
    I know about #2. See my update. We calculate hashCode once, but we can call method equals a lot of time – Mary Ryllo Jan 10 '13 at 16:33
  • @assylias: Ok if hashCode was previous calculated, it may be a good ides to compare them first and only compare the Strings if the hashCode is equal. But of the hashCodes of both Strings were not calculated before, comparing the Strings directly is faster. – MrSmith42 Jan 10 '13 at 16:36
  • I am not sure why this answer is getting upvoted so much - there is a pending proposal to make the change detailed in the question - if it were so obviously counterproductive it would have been rejected... – assylias Jan 10 '13 at 17:05
  • 1
    @assylias: My answer does not mean that adding a hashCode compare would be a bad idea. It only lists what may have lead to the current design decision. – MrSmith42 Jan 10 '13 at 17:10
4

This can be a good idea for many use cases.

However, as a foundation class that is widely used in all kinds of applications, the author really has no idea whether this extra checking can save or hurt performance on average.

I'm gonna guess that the majority of String.equals() are invoked in a Hashmap, after the hash codes has been known to be equal, so testing hash codes again is pointless.

If we consider comparing 2 random strings, even with a small char set like US ASCII, it is very likely that the hashes are different, and char-by-char comparison fails on 1st char. So it'll be a waste to check hashes.

irreputable
  • 44,725
  • 9
  • 65
  • 93
2

AFAIK, The following check could be added to String. This check that if the hash codes are set and they are different, then the Strings cannot be equal.

if (hash != 0 && anotherString.hash != 0 && hash != anotherString.hash)
    return false;
if (hash32 != 0 && anotherString.hash32 != 0 && hash32 != anotherString.hash32)
    return false;
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I think if frofiling shows that this cases match often (so the extra code improves the speed of the method in average), it might be a good idea to add this to the current implementation. – MrSmith42 Jan 10 '13 at 17:06
  • Hmmm, submitted four years ago. :P – Peter Lawrey Jan 10 '13 at 17:07
  • 1
    @PeterLawrey Not sure why it was not closed (either positively or negatively) when they pushed the redesigned string class a few months ago... – assylias Jan 10 '13 at 17:09
0

The string hash code is not available for free and automatically. In order to rely on hash code, it must be computed for both strings and only then can be compared. As collisions are possible, the second char-by-char comparison is required if the hash codes are equal.

While String appears as immutable for the usual programmer, it does have the private field to store its hashcode once it is computed. However this field is only computed when hashcode is first required. As you can see from the String source code here:

 private int hash;

 public int hashCode() {
      int h = hash;
      if (h == 0) {
         ...
         hash = h;  
      }
      return h;
 }

Hence it is not obvious that it makes sense to compute the hashcode first. For your specific case (maybe same instances of really long strings are compared with each other a really many of times), it still may be: profile.

Audrius Meškauskas
  • 20,936
  • 12
  • 75
  • 93
  • 1
    There could be a first check that the hashcode is available for both strings, in which case compare them, otherwise move on to the normal algo. – assylias Jan 10 '13 at 16:34
  • 1
    The question is why String#equals does not use String#hashcode to speed up tests. – assylias Jan 10 '13 at 16:41
0

As I think, hashCode() can make comparison of two strings quicker.

Arguments?

Arguments against this proposal:

  • More Operations

hashcode() from String has to access every character in the String and has to do 2 calculations for every character.
So we need for a string with n characters 5*n operations (load, multiplication, lookup/load, multiplication, store). Two times, because we compare two Strings. (Ok, one store and one load does not really count in a reasonable implementation.)
For the best case, this makes a total of 10*x operations for two strings with length m and n and x=min(m,n). Worst case is 10*x with x=m=n. Average somewhere between, perhaps (m*n)/2.

The current equals implementation needs in the best case 3 operations. 2 loads, 1 compare. Worst is 3*x operations for two strings with length m and n and x=m=n. Average is somewhere between, perhaps 3*(m*n)/2.

  • Even if we cache the hash, it is not clear that we save something

We have to analyze usage patterns. It could be that most of the time, we will only ask one time for equals, not multiple times. Even if we ask multiple times, it could not be enough to have time savings from the caching.

Not direct against the speed, but still good counterarguments:

  • Counter intuitive

We do not expect a hashcode in equals, because we know for sure that hash(a)==hash(b) for some a!=b. Everyone reading this (and knowledge about hashing) will wonder what is happening there.

  • Leads to bad examples/unexpected behavior

I can already see the next question on SO: "I have a String with some billion times 'a'. Why does it take forever to compare it with equal() against 'b'?" :)

tb-
  • 1,240
  • 7
  • 10
  • 1
    the question is not about calculating the hashcode when equals is called, it is about checking if the hashcode has already been calculated, in which case use it, and if it hasn't move on to the usual method. The extra penalty in your latest example is going to be a couple of machine instructions `if (hash == 0 /*i.e. not calculated yet*/) goto usual method;`. – assylias Jan 10 '13 at 18:30
0

If the hash code takes the whole content of the string into account, then calculating the hash code of a string with n characters takes n operations. For long strings that's a lot. Comparing two strings takes n operations if they are the same, not longer than calculating the hash. But if the strings are different, then a difference is likely to be found a lot earlier.

String hash functions usually don't consider all characters for very long strings. In that case, if I compare two strings I could first compare the characters used by the hash function, and I'm at least as fast as checking the hashes. But if there is no difference in these characters, then the hash value will be the same, so I have to compare the full strings anyway.

Summary: A good string comparison is never slower but often a lot faster than comparing the hashes (and comparing strings when the hashes match).

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • The suggestion does not require calculating the hashcodes - it only compares them if they have been calculated (and cached) already. – assylias Oct 14 '15 at 18:32