20

We are writing a class which requires very complicated logic to compute equals() and hashCode(). Something along lines with:

@Getters @Setters @FieldDefaults(level=AccessLevel.PRIVATE)
public class ExternalData {
  TypeEnum type;
  String data;
  List<ExternalData> children;
} 

We do not construct these objects, they are deserialized from XML from external complex system. There are 20+ types and depending on type data can be ignored, or processed with children, or processed without children and comparison of data for each type of node depend on type.

We created equals() and hashCode() to reflect all those rules, but recently ran into an issue that hashCode got out of sync with equals causing equal objects to be added twice to a HashSet. I believe HashMap (and HashSet for that matter) are implemented this way in Java: https://en.wikipedia.org/wiki/Hash_table The implementation first put objects in buckets based on hashCode and then for each bucket checks equals. In unfortunate scenario where 2 equal object will go to different buckets they will never be compared by equals(). By "Out of sync" here I mean that they go into different buckets.

What is the best way to make sure equals and hashCode do not get out of sync?

Edit: This question is different from What issues should be considered when overriding equals and hashCode in Java? There they ask about generic guidance and the accepted answer does not apply in my situation. They say "make equals and hashCode consistent", here I'm asking how exactly do I do that.

Community
  • 1
  • 1
Artem
  • 7,275
  • 15
  • 57
  • 97
  • 4
    Property-based testing, probably. There is no simple rule, you have to test. – Paul Hicks Jul 03 '16 at 21:30
  • 2
    Basically code review. You could probably do a static analyais in some cases, to check that equals and hashCode use the same fields, but the corner cases will probably overwhelm such a system. You could also use a project like [Lombok](https://projectlombok.org/) to generate them for you. – yshavit Jul 03 '16 at 21:30
  • 1
    Wherever possible, use auto-generation/etc. to help you. For example, check out the [Immutables](http://immutables.github.io/) library - generates hashCode/equals for you. – Oliver Charlesworth Jul 03 '16 at 21:33
  • Have `equals` call `hashCode`? – Robert Jul 03 '16 at 22:00
  • 4
    @Robert That's a horrible idea. [The official documentation of `hashCode`](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--) indicates that **unequal** objects may have the same `hashCode`. – The SE I loved is dead Jul 04 '16 at 01:43
  • @dorukayhan Where do you get that from? In fact, it says "If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result." – Robert Jul 04 '16 at 01:56
  • 8
    @Robert Yes, equal objects **must** have the same hash code, but *unequal* objects *may* also have the **same** hash code, which is what dorukayhan said. Consider `Long`. There are 2^64 possible `Long` objects, and only 2^32 possible hash codes. By the pigeonhole principle, many unequal `Long` objects must have identical hash codes. – David Conrad Jul 04 '16 at 03:59
  • I'm no Java expert, but wouldn't the `Set` class check that two objects are equal (via `equals()`) if they have the same hash code? If that's true, then your `hashCode()` and `equals()` methods could be out of sync, but it wouldn't cause duplicates to appear in a set - only a bug in the `equals()` method would do that. – reduckted Jul 04 '16 at 04:21
  • @thirdwaffle The OP is saying that two objects that should be considered equal are getting two **different** hashcodes, hence these elements are being added twice. – Bakuriu Jul 04 '16 at 07:38
  • The solution is not to have complicated logic for equals. – Raedwald Jul 04 '16 at 07:55
  • 1
    @thirdwaffle The key point here is (a bit simplified) : The `Set` does **not** necessarily check for `equals` when the objects already have different `hashCode`. Also, it is not entirely clear what "out of sync" means in particular. In fact, one *could* implement `int hashCode() { return 0; }` for *all* classes. This would be *guaranteed* to be correct (as the `Set` then only relies on `equals`). But it would potentially be horribly inefficient, so certainly not a feasible approach in general. – Marco13 Jul 04 '16 at 08:57
  • 1
    I believe HashMap (and HashSet for that matter) are implemented this way in Java: https://en.wikipedia.org/wiki/Hash_table The implementation first put objects in buckets based on hashCode and then for each bucket checks equals. In unfortunate scenario where 2 equal object will go to different buckets they will never be compared by equals() – Artem Jul 04 '16 at 18:51
  • *They say "make equals and hashCode consistent", here I'm asking how exactly do I do that.* you write custom code that enforces the consistency there is no magic to it. Just wanting a different answer does not make it not a duplicate, it is a duplicate and the answer directly answers your question. Make them consistent, right now they are not. –  Jul 04 '16 at 18:55
  • @JarrodRoberson I think I have a more specific case compare to generic question mentioned above. And I'm asking to provide specifics on how do I solve the issue in my case. – Artem Jul 04 '16 at 18:59
  • then you need to ask a new more specific question ... but it will probably be answerable by the exact same generalized answer, which is *you need to write custom code to make it consistent* which as always been the correct answer. All the existing answers, confirm this as they all say the exact same thing in different generalized ways, thus making this **off-topic: recommendations** and **off-topic: opinion based** as well as a duplicate. –  Jul 04 '16 at 19:03
  • Just adding an edit that says *this is not a duplicate because I don't like the answers and want a spoon fed recommendation* does not make it not a duplicate –  Jul 04 '16 at 19:07
  • @JarrodRoberson: I'm not convinced that it's a duplicate. The linked question is asking *what things do I need to do*, while this one is asking *knowing what things I need to do, how do I ensure I have done them correctly*. I don't think this question actually needs to be closed. – Daniel Pryden Jul 04 '16 at 23:15
  • how to ensure they are consistent is easy, you write a unit test that proves that `a.equals(b) && a.hashCode() == b.hashCode()` that is what consistent means, there is nothing complicated about this and rewording the question sligthly does not change the correct answer. *Make them consistent*, how you do that is a case by case basis that is off-topic: too broad, recommendations as well as opniion based. –  Jul 05 '16 at 02:58

4 Answers4

6

The Guava testlib library has a class called EqualsTester that can be used to write tests for your equals() and hashCode() implementations.

Adding tests both helps you ensure that the code is correct now, and also ensures that it stays correct if/when you modify it in the future.

Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • @T.J.Crowder I liked your answer yesterday, what was the comment? :) – Artem Jul 04 '16 at 18:46
  • @T.J.Crowder I think it is actually a good idea and probably the only 100% way of guarantee that they are precisely the same not relying on people to write more tests and add in 2 places if they add functionality. Our system is offline (in a sense), there are no hard requirements on memory limits but there are on consistency. We actually do plan to have common code return a pair of hashCode and a boolean (if second argument was provided). So I'd be happy to accept your original answer :) – Artem Jul 05 '16 at 07:38
  • @Artem: I wasn't happy with the answer for the reasons Daniel pointed out, but I've updated it to something I'm happy with now and undeleted it. (I've also removed my comments above, as they really have nothing to do with Daniel's answer here.) – T.J. Crowder Jul 05 '16 at 08:47
5

If the traversal algorithm is sufficiently complex that you want to avoid repeating yourself, isolate the algorithm into a method that both equals and hashCode can use.

I see two options, which (as is so often the case) trade-off between being broadly-applicable and being efficient.

Broadly-applicable

The first option is to write quite a general traversal method that accepts a functional interface and calls back to it at each stage of the traversal, so you can pass a lambda or instance into it containing the actual logic you want to perform while traversing; the Visitor pattern. That interface would want to have a way to say "stop traversing" (e.g., so equals can bail when it knows the answer is "not equal"). Conceptually, that would look something like:

private boolean traverse(Visitor visitor) {
    while (/*still traversing*/) {
        if (!visitor.visitNode(thisNode)) {
            return false;
        }
        /*determine next node to visit and whether done*/
    }
    return true;
}

Then equals and hashCode use that to implement equality checking or hash code building without having to know the traversal algorithm.

I've chosen above to have the method return a flag for whether traversal ended early, but that's a design detail. You might not return anything, or might return this for chaining, whatever's suitable in your situation.

The issue, though, is that using it means allocating an instance (or using a lambda but then you probably need to allocate something for the lamba to update anyway to keep track of what it's doing) and doing a lot of method calls. Maybe that's fine in your case; maybe it's a performance killer because your app needs to use equals a lot. :-)

Specific and efficient

...and so you might want to write something specific to this case, writing something that has the logic for equals and hashCode built into it. It would return the hash code when being used by hashCode, or a flag value for equals (0 = not equal, !0 = equal). No longer generally-useful, but it avoids creating a visitor instance to pass in / lambda overhead / call overhead. Conceptually, this might look something like:

private int equalsHashCodeWorker(Object other, boolean forEquals) {
    int code = 0;

    if (forEquals && other == null) {
        // not equal
    } else {
        while (/*still traversing*/) {
            /*update `code` depending on the results for this node*/
        }
    }

    return code;    
}

Again the specifics will be, um, specific to your case as well as your style guide and such. Some folks would make the other argument serve two purposes (both flag and the "other" object) by having equals handle the other == null case itself and only call this worker when it has a non-null object. I prefer to avoid doubling up on the meaning of arguments like that, but you see it often enough.

Testing

Whichever way you go, if you're in a shop with a testing culture, naturally you'll want to build tests for the complex cases you've already seen fail as well as other cases where you see failure opportunities.

Side note about hashCode

Regardless of the above, if you expect hashCode to be called a lot, you might consider caching the result to an instance field. If the object you're doing this with is mutable (and it sounds like it is), you'd invalidate that stored hashcode any time you mutate the object's state. That way, if the object hasn't changed, you don't have to repeat the traversal on subsequent calls to hashCode. But of course, if you forget to invalidate the hash code in even one of your mutator methods...

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 4
    That doesn't really make sense to me: `equals()` requires a second object to compare to, but `hashCode()` doesn't. So I don't see any easy way to use the same code path for both. And I definitely would not encourage doing any allocations in `equals()` or especially `hashCode()`, since those methods will be called from code you don't control, often on a hot path. For example, adding an unrelated object to a `HashMap` that contains your object may cause your object's `hashCode()` to be invoked -- if that happens many thousands of times per second then you'll create a lot of GC pressure. – Daniel Pryden Jul 03 '16 at 22:56
  • 1
    @DanielPryden HashMap doesn't cache the hash code? Also, cached or not, it'll almost certainly only look at the hash of objects already in the table if it is resized, which will not happen on every insertion. – Random832 Jul 04 '16 at 03:29
  • @DanielPryden: Both very good points. I've updated the answer. – T.J. Crowder Jul 05 '16 at 08:46
5

one option to condsider may be code generation. Basically you write out a list of things that need to be compared and have a program that generates both an equals method and a hashcode method. Since both methods are generated from the same list of things to compare they should not fall out of sync (provided the individual elements don't of course).

plugwash
  • 9,724
  • 2
  • 38
  • 51
0

if a.equals(b), this implies a.hashcode() == b.hashcode().

However, be careful. !a.equals(b) does NOT imply a.hashcode() != b.hashcode().

This is simply due to the fact that hash collisions can be a serious issue depending on your algorithm and a large number of factors. In general, if two objects are equal their hashcode will always be equal. However, you cannot determine if two objects are equal only by comparing hashcode, as a.hashode() == b.hashcode() also does not imply a.equals(b).

Will Sherwood
  • 1,484
  • 3
  • 14
  • 27