2

Background information:

In my project I'm applying Reinforcement Learning (RL) to the Mario domain. For my state representation I chose to use a hashtable with custom objects as keys. My custom objects are immutable and have overwritten the .equals() and the .hashcode() (which were generated by the IntelliJ IDE).

This is the resulting .hashcode(), I've added the possible values in comments as extra information:

@Override
public int hashCode() {
    int result = (stuck ? 1 : 0);                // 2 possible values: 0, 1
    result = 31 * result + (facing ? 1 : 0);     // 2 possible values: 0, 1 
    result = 31 * result + marioMode;            // 3 possible values: 0, 1, 2
    result = 31 * result + (onGround ? 1 : 0);   // 2 possible values: 0, 1 
    result = 31 * result + (canJump ? 1 : 0);    // 2 possible values: 0, 1 
    result = 31 * result + (wallNear ? 1 : 0);   // 2 possible values: 0, 1 
    result = 31 * result + nearestEnemyX;        // 33 possible values: - 16 to 16
    result = 31 * result + nearestEnemyY;        // 33 possible values: - 16 to 16

    return result;
}

The Problem:

The problem here is that the result in the above code can exceed Integer.MAX_VALUE. I've read online this doesn't have to be a problem, but in my case it is. This is partly due to algorithm used which is Q-Learning (an RL method) and depends on the correct Q-values stored inside the hashtable. Basically I cannot have conflicts when retrieving values. When running my experiments I see that the results are not good at all and I'm 95% certain the problem lies with the retrieval of the Q-values from the hashtable. (If needed I can expand on why I'm certain about this, but this requires some extra information on the project which isn't relevant for the question.)

The Question:

Is there a way to avoid the integer overflow, maybe I'm overlooking something here? Or is there another way (perhaps another datastructure) to get reasonably fast the values given my custom-key?

Remark:

After reading some comments I do realise that my choice for using a HashTable wasn't maybe the best one as I want unique keys that do not cause collisions. If I still want to use the HashTable I will probably need a proper encoding.

Floris Devriendt
  • 2,044
  • 4
  • 24
  • 34
  • 1
    If you have such a special use of hashes, why do you use Object's .hashCode() to begin with? Why not use a dedicated hash? You can for instance use Guava's `HashFunction`s – fge May 13 '14 at 14:23
  • @fge That's `Object`'s `hashCode()`? I thought the normal implementation of `hashCode()` was a native method? – awksp May 13 '14 at 14:24
  • @user3580294 if this class is immutable, then the code above will generate a non-changing hash code. Anyway, that's not the main point in the question. – Luiggi Mendoza May 13 '14 at 14:25
  • @LuiggiMendoza Yep, just realized... – awksp May 13 '14 at 14:25
  • @user3580294 I meant "use" in the sense that the OP overrides it for its own needs, which do not seem to be the ones of `Object` – fge May 13 '14 at 14:25
  • @fge I'm not sure what you mean by that. Do you mind explaining? – awksp May 13 '14 at 14:27
  • 1
    @user3580294 it looks to me like the OP wants to generate unique keys for each object or something; the problem, of course, is that hashCode() makes no guarantees about that – fge May 13 '14 at 14:28
  • As to the data structure, have you considered a radix tree? – fge May 13 '14 at 14:29
  • Floris: By "retrieving values" do you mean getting values from the `HashMap`? Because if so the precise values `hashCode()` outputs shouldn't matter as it would only be used by the map and if your objects are immutable I don't see where a problem would be... – awksp May 13 '14 at 14:29
  • @fge He said he was using a hash table though, so I don't see how non-unique keys would be problematic... In any case, he seems to be more concerned about the range of values that `hashCode()` could output more than uniqueness – awksp May 13 '14 at 14:30
  • @user3580294: Yes, with retrieving the values I mean getting values from the ``HashMap``. The problem is I need a 1-to-1 mapping from key to value with no collisions, but the values need to be retrieved very fast. I'm aware that a ``HashTable`` isn't probably the right choice for datastructure, but I believe running over all my keys to find the correct one is too much time consuming. So yeah, you are correct in me wanting to have unique keys. – Floris Devriendt May 13 '14 at 14:35
  • 1
    Floris: `HashMaps` are already pretty darn fast, but the Java implementation does have some behind-the-scenes stuff like extra hashing that might slow things down. Also, it probably would have helped to say you needed perfect hashes instead of worrying about the range of numbers returned by `hashCode()`. If you're using the objects as keys into a `HashMap` the range of values returned by `hashCode()` doesn't matter so long as the method obeys the contract for `hashCode()` – awksp May 13 '14 at 14:37

4 Answers4

8

You need a dedicated Key Field to guarantee uniqueness

.hashCode() isn't designed for what you are using it for

.hashCode() is designed to give good general results in bucketing algorithms, which can tolerate minor collisions. It is not designed to provide a unique key. The default algorithm is a trade off of time and space and minor collisions, it isn't supposed to guarantee uniqueness.

Perfect Hash

What you need to implement is a perfect hash or some other unique key based on the contents of the object. This is possible within the boundries of an int but I wouldn't use .hashCode() for this representation. I would use an explicit key field on the object.

Unique Hashing

One way to use use SHA1 hashing that is built into the standard library which has an extremely low chance of collisions for small data sets. You don't have a huge combinational explosion in the values you posts to SHA1 will work.

You should be able to calculate a way to generate a minimal perfect hash with the limited values that you are showing in your question.

A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers—usually [0..n−1] or [1..n]. A more formal way of expressing this is: Let j and k be elements of some finite set K. F is a minimal perfect hash function iff F(j) =F(k) implies j=k (injectivity) and there exists an integer a such that the range of F is a..a+|K|−1. It has been proved that a general purpose minimal perfect hash scheme requires at least 1.44 bits/key.2 The best currently known minimal perfect hashing schemes use around 2.6 bits/key.[3]

A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, ..., an and for any keys aj and ak, j

A minimal perfect hash function F is monotone if it preserves the lexicographical order of the keys. In this case, the function value is just the position of each key in the sorted ordering of all of the keys. If the keys to be hashed are themselves stored in a sorted array, it is possible to store a small number of additional bits per key in a data structure that can be used to compute hash values quickly.[6]

Solution

Note where it talks about a URL it can be any byte[] representation of any String that you calculate from your object.

I usually override the toString() method to make it generate something unique, and then feed that into the UUID.nameUUIDFromBytes() method.

Type 3 UUID can be just as useful as well UUID.nameUUIDFromBytes()

Version 3 UUIDs use a scheme deriving a UUID via MD5 from a URL, a fully qualified domain name, an object identifier, a distinguished name (DN as used in Lightweight Directory Access Protocol), or on names in unspecified namespaces. Version 3 UUIDs have the form xxxxxxxx-xxxx-3xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit and y is one of 8, 9, A, or B.

To determine the version 3 UUID of a given name, the UUID of the namespace (e.g., 6ba7b810-9dad-11d1-80b4-00c04fd430c8 for a domain) is transformed to a string of bytes corresponding to its hexadecimal digits, concatenated with the input name, hashed with MD5 yielding 128 bits. Six bits are replaced by fixed values, four of these bits indicate the version, 0011 for version 3. Finally, the fixed hash is transformed back into the hexadecimal form with hyphens separating the parts relevant in other UUID versions.

My preferred solution is Type 5 UUID ( SHA version of Type 3)

Version 5 UUIDs use a scheme with SHA-1 hashing; otherwise it is the same idea as in version 3. RFC 4122 states that version 5 is preferred over version 3 name based UUIDs, as MD5's security has been compromised. Note that the 160 bit SHA-1 hash is truncated to 128 bits to make the length work out. An erratum addresses the example in appendix B of RFC 4122.

Key objects should be immutable

That way you can calculate toString(), .hashCode() and generate a unique primary key inside the Constructor and set them once and not calculate them over and over.

Here is a straw man example of an idiomatic immutable object and calculating a unique key based on the contents of the object.

package com.stackoverflow;

import javax.annotation.Nonnull;
import java.util.Date;
import java.util.UUID;

public class Q23633894
{

    public static class Person
    {
        private final String firstName;
        private final String lastName;
        private final Date birthday;
        private final UUID key;
        private final String strRep;

        public Person(@Nonnull final String firstName, @Nonnull final String lastName, @Nonnull final Date birthday)
        {
            this.firstName = firstName;
            this.lastName = lastName;
            this.birthday = birthday;
            this.strRep = String.format("%s%s%d", firstName, lastName, birthday.getTime());
            this.key = UUID.nameUUIDFromBytes(this.strRep.getBytes());
        }

        @Nonnull
        public UUID getKey()
        {
            return this.key;
        }

        // Other getter/setters omitted for brevity

        @Override
        @Nonnull
        public String toString()
        {
            return this.strRep;
        }

        @Override
        public boolean equals(final Object o)
        {
            if (this == o) { return true; }
            if (o == null || getClass() != o.getClass()) { return false; }
            final Person person = (Person) o;
            return key.equals(person.key);
        }

        @Override
        public int hashCode()
        {
            return key.hashCode();
        }
    }
}
  • There's no such a thing like "perfect hash", because a collision may occur anyway. By the way, you'll really need to hack the things to get the same hashsum – Dmitry Ginzburg May 13 '14 at 14:30
  • you can create a perfect hash for anything given enough bits to cover the set of possible values –  May 13 '14 at 14:31
  • @JarrodRoberson Are you even sure OP's looking for a perfect hash? He said he's using a hash table, which shouldn't care about the hash values that much unless he's rolling his own – awksp May 13 '14 at 14:33
  • @JarodRoberson that's truth in the mathematical sense, but `hashCode()` method returns an integer, which is bounded. – Dmitry Ginzburg May 13 '14 at 14:33
  • it appears they are using `.hashCode()` a the `KEY`, and expecting it to be unique which isn't going to work –  May 13 '14 at 14:34
  • @JarrodRoberson I'm fairly certain that SHA-1 hashing would be too expensive for OP to use too, given that speed appears to be a large priority – awksp May 13 '14 at 14:39
  • @JarrodRoberson "For my state representation I chose to use a hashtable with custom objects as keys" – awksp May 13 '14 at 14:41
  • 1
    @JarrodRoberson I accepted your answer because of how informative it is, but in the the answer of Victor-Philipp Negoescu and Peter Lawrey also helped me. – Floris Devriendt May 21 '14 at 22:22
6

For a unique representation of your object's state, you would need 19 bits in total. Thus, it is possible to represent it by a "perfect hash" integer value (which can have up to 32 bits):

@Override
public int hashCode() {
    int result = (stuck ? 1 : 0); // needs 1 bit (2 possible values)
    result += (facing ? 1 : 0) << 1; // needs 1 bit (2 possible values)
    result += marioMode << 2; // needs 2 bits (3 possible values)
    result += (onGround ? 1 : 0) << 4; // needs 1 bit (2 possible values)
    result += (canJump ? 1 : 0) << 5; // needs 1 bit (2 possible values)
    result += (wallNear ? 1 : 0) << 6; // needs 1 bit (2 possible values)
    result += (nearestEnemyX + 16) << 7; // needs 6 bits (33 possible values)
    result += (nearestEnemyY + 16) << 13; // needs 6 bits (33 possible values)
}
  • Might it be even better to calculate the `hashCode()` once and cache it because the object is immutable? – awksp May 13 '14 at 14:40
  • 1
    Of course, but that was not the question. :-) Ok, it was part of the question. But calculation of such a hash code is somewhat simple. You would need to store 2^19 integers (all possible hash codes) which would need at least 2^19 * 32 bit = 2 MB. – Victor-Philipp Negoescu May 13 '14 at 14:40
  • Yeah, true. Just thought it might help – awksp May 13 '14 at 14:42
  • 2
    @user3580294 to be honest the whole operation is going to take 10 ns or so on a recent cpu so it is unlikely that this is where the program is going to spend some time. – assylias May 13 '14 at 14:42
  • 1
    @assylias The math might be fast, true. Depends on how frequently the map is accessed. – awksp May 13 '14 at 14:46
  • @user3580294, there is no point to calc the hashCode separately actually (in this very case). The "optimization" results in larger objects and potentially extra cache line used (due to equals, the class have to read all the fields either way). In this case it's not needed. However if you need performance you'd not use separate fields to represent the state but just a single `int` and use it both for hashCode/equals. Either way - there should be no extra field for `hash`. – bestsss May 14 '14 at 14:41
  • @bestsss If you were to use a single `int` for all the state wouldn't that mean you'd have to do calculations every time you wanted to extract the value of a field though? Good point with the cache misses though... – awksp May 14 '14 at 19:49
  • @user3580294, yes the "calculations" are likely to take 1-2 extra CPU cycles (no branches, code is prefetched), compared to the possible cache misses (due to extra fields, even boolean is 4bytes long) it's profitable.... just less readable/modifiable. Basically all the data is guaranteed to be local if you have to use more than one field - in modern architecture cache misses are the ones that dominate performance issues as they are a lot more expensive than anything else (save for div/mod/CAS alike/etc) – bestsss May 14 '14 at 23:02
  • @bestsss Good point... Only way I could see that not mattering if it were one object per line but I doubt that that's the case (64 byte cache lines if memory serves, and wasted space would not be a good idea) – awksp May 15 '14 at 01:38
  • Your answer was helpful but I could only accept one. – Floris Devriendt May 21 '14 at 22:22
2

Instead of using 31 as a your magic number, you need to use the number of possibilities (normalised to 0)

@Override
public int hashCode() {
    int result = (stuck ? 1 : 0);                // 2 possible values: 0, 1
    result = 2 * result + (facing ? 1 : 0);      // 2 possible values: 0, 1 
    result = 3 * result + marioMode;             // 3 possible values: 0, 1, 2
    result = 2 * result + (onGround ? 1 : 0);    // 2 possible values: 0, 1 
    result = 2 * result + (canJump ? 1 : 0);     // 2 possible values: 0, 1 
    result = 2 * result + (wallNear ? 1 : 0);    // 2 possible values: 0, 1 
    result = 33 * result + (16 + nearestEnemyX); // 33 possible values: - 16 to 16
    result = 33 * result + (16 + nearestEnemyY); // 33 possible values: - 16 to 16

    return result;
}

This will give you 104544 possible hashCodes() BTW you can reverse this process to get the original values from the code by using a series of / and %

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    Thanks for the top on my nearestEnemyX & nearestEnemyY by adding the 16 to make the numbers positive. This helped me as well, just as the accepted answer. Unfortunately I can only accept one. – Floris Devriendt May 21 '14 at 22:24
-1

Try Guava's hashCode() method or JDK7's Objects.hash(). It's way better than writing your own. Don't repeat code yourself (and anyone else when you can use out of box solution):

Ray
  • 4,829
  • 4
  • 28
  • 55
  • `java.util` doesn't have a `hashCode()` method... And `Object#hashCode()` only works for reference equality, not object equality – awksp May 13 '14 at 14:31
  • 1
    @user3375746 What you propose won't solve the issue of getting negative integers. – assylias May 13 '14 at 14:43
  • 2
    @assylias I don't think that the negative integers are really the issue here (negative integers are valid hashCodes AFAIK) but the uniqueness which wouldn't be solved by this answer either. – Puce May 13 '14 at 15:07