9

How do I implement a good hashcode if there are two boolean fields? Usually people just add the integer values to their hashcode values. But if I simply add 1 or 0 to my hashcode, I do not think it is good. Because if I have two objects of class A:

obj1.b = true, obj1.c = false.

obj2.b = false, obj2.c = true.

Everyting else is the same. Then the hashcode for these two unequal objects are the same. I know this situation is okay. But imagine if there are 100 boolean fields, then there would be too much collision right? I do not want so many different objects to fall in the same bucket.

What I did below is to assign different numbers to different truth values for each field so objects hashcodes can be very different.

public class A {

    private final String a;
    private final boolean b;
    private final boolean c;

...

@Override public int hashCode(){
            int i,j;
            if(b) {
                    i = 10;
            }
            else {
                    i = 0;
            }
            if(c) {
                    j = 88;
            }
            else {
                    j = 3;
            }
            int result = 0;
            result = 31*result + i + j;
            result = 31*result + (a != null ? a.hashCode() : 0);
            return result;
    }
}
loop
  • 219
  • 2
  • 5
  • 12
  • You could use [`Boolean.hashCode()`](http://stackoverflow.com/questions/3912303/boolean-hashcode), or you could "bit-shift" one of them to use different bits of the result, similar to [how bit flags work](http://eddmann.com/posts/using-bit-flags-and-enumsets-in-java/) – Jon Egeland Feb 24 '15 at 03:43
  • I cannot find your question. But your concern can be solved by a simple logic as described in http://stackoverflow.com/a/113600/395202. (which is similar to what you did, just simpler). – Adrian Shum Feb 24 '15 at 03:44
  • Another good explanation: http://stackoverflow.com/questions/3912303/boolean-hashcode – Marcelo Keiti Feb 24 '15 at 03:49
  • @MarceloKeiti That has the same answer as the one that Adrian linked – Jon Egeland Feb 24 '15 at 03:49

3 Answers3

9

You have a couple of options:

Option 1: Bit flagging

The best way to guarantee that there can never be collisions between boolean hashes is to use a technique similar to the one used in bit flagging, whereby you have each boolean occupy its own bit. For example:

// `byte` can be replaced with `short`, `int`, or `long` to fit all of your variables.
byte = 0;
if(bool1) booleans += 1;  // 0001
if(bool2) booleans += 2;  // 0010
if(bool3) booleans += 4;  // 0100
if(bool4) booleans += 8;  // 1000
...

However, this approach quickly becomes inefficient with a large number of booleans and is highly dependent on the size of the target array. For example, if you have a target array of size 16, only the first 4 have an effect on the hash value (since the maximum index is 1111).

The two solutions to this are to either increase the size of your target array (which might not be under your control), or ensure that the order of your booleans goes from most to least variadic. Neither of these are optimal, and so this method is quick and easy, but not very effective in practice.

Option 2: Base-changing hash

The design that Pham Trung shows in his answer expands on Option 1 as an easier way to accomodate multiple fields. As Adrian Shum commented, this answer provides an overview of a "general hashing algorithm" which is designed to be effective independent of what you are trying to hash.

The basic idea is to multiply a simplified hash value for each type by some arbitrarily large prime number to ensure that each hash is unique (though the proof for this eludes me). For example:

int result = 0;
result = 31*result + bool1 ? 1 : 0;
result = 31*result + bool2 ? 1 : 0;
...

For an even more sparse hash distribution, you can combine this with Boolean.hashCode, as the other answers show:

int result = 0;
result += 31*result + bool1.hashCode();
result += 31*result + bool2.hashCode();
...

What's great about this solution is that it can be applied to other types, like you already have in your sample code:

...
result = 31*result + i;
result = 31*result + (a != null ? a.hashCode() : 0);
result = 31*result + my_complex_object.hashCode();

Note: In these examples, 31 is just some arbitrary prime. You could just have easily used 37, 113, or 23456789. However, there are trade-offs for using larger multiplicands, namely that your hash will more quickly exceed Integer.MAX_VALUE and invalidate your hash.

Community
  • 1
  • 1
Jon Egeland
  • 12,470
  • 8
  • 47
  • 62
  • 1
    -1 for option 1, which is not helping in OP's concern. i.e. if there are two boolean values involved in hashcode calculation, true+false and false+true will give same hashcode. Option 2 is not optimal either, given will easily goes to same bucket (for example, for hash map with capacity of 16, hashcode with the "result bitmap" being 00100001 and 11110001 will fall into same bucket). In short, option1 and 2 are example of poorly designed hashing algorithm – Adrian Shum Feb 25 '15 at 01:18
  • @AdrianShum you're right. I had it in my mind that the values it used incremented every time it was called. I'll remove it. – Jon Egeland Feb 25 '15 at 01:20
  • "option 2" in my original comment becomes the new "option 1", sorry for editing the comment not quick enough ;) – Adrian Shum Feb 25 '15 at 01:27
  • @AdrianShum I see your point about the bitflagging option. However, you run into the same issue using the "base-changing" option as well. In that case with a hash of size 16, and a prime value of 31, you could end up with `(F,F,T)=00000001` and `(T,T,F)=11100001`, both of which would again fall into the same bucket. Simply put, you will always have collisions if your range of potential values exceeds the size of the target array. If you want, I can add a note about the dependence of the success of the hash on the size of the target. – Jon Egeland Feb 25 '15 at 03:40
  • I think you missed the point. Assuming you have 100 booleans in the hashcode calculation, with the "bitset" approach, only the least significant 5 booleans will determine which bucket it will go (in case of a hashmap of capacity of 16). That means, for some reason, if most data is having same value for that 5booleans, but very different for the other 95 booleans, they will still go to same bucket. That's why it is a poor hashing algorithm. However with the so-called base-changing approach, such effect is insignificant – Adrian Shum Feb 25 '15 at 03:51
  • @AdrianShum I've added some notes to Option 1. If you think it could be worded better or if I've missed anything, feel free to edit the answer with your input on the matter. However, Option 1 *is* more effective for small numbers of inputs, and so I feel it is important to leave it in the answer. – Jon Egeland Feb 25 '15 at 04:03
  • I do not think the update cover all the concern. In brief, I do not think option 1 should even be mentioned because 1. it is a poor hashing algorithm base on what I mentioned, 2. Although it run slightly faster, in real-life the difference is rarely significant. i.e. This option is only applicable to very rare and specialized situation, which is not worth mentioning (or it is in fact bad mentioning here because other people may mistakenly used this approach which caused problem) – Adrian Shum Feb 25 '15 at 04:17
5

When you have two or even more booleans, the hashcode algorithm is already taken care of that.

Look a little bit closer:

// Very simple example
public int hashCode() {
    int result = 31;

    for(boolean val : booleanList) {
        // This is the important part:
        result = 31 * result + Boolean.hashCode(val);
    }

    return result;
}

Notice the main part of the for loop, in this case, we treat each boolean differently, as we always multiply the result by 31 (or any good prime number) before adding it into the result.

If we visualize the whole hashcode as a number of base 31, so we can understand that the position and the value of each boolean are all taken into account in the final result. Each boolean can be treated as a digit in the hashcode, so for case (true, false) and case (false, true), they will have two different hashcodes.

squ1dd13
  • 84
  • 2
  • 9
Pham Trung
  • 11,204
  • 2
  • 24
  • 43
1

When combining multiple fields, a good practice is to start with the first field hashCode then multiply the current result by a prime number before adding each field hashcode:

@Override public int hashCode() {
  int result = 0;
  result = b1.hashCode();
  result = result * 37 + b2.hashCode();
  result = result * 37 + b3.hashCode();
  // ...
  // result = result * 37 + bn.hashCode();
  return result;
}

You can see a real world example in code generated by Wire (a Prococol Buffers implementation).

References:

Community
  • 1
  • 1
Filipe Borges
  • 2,712
  • 20
  • 32