0

I am currently in need to serialize arbitrary Java objects since I would like to use the Hash as a key for a hash table. After I read various warnings that the default hashCode creates collisions way to often, I wanted to switch to hashing via MessageDigest to use alternative algorithms (e.g. SHA1, ...) that are said to allow more entries without collisions. [As a sidenote: I am aware that even here collisions can occur early on, yet I want to increase the likelihood to remain collision free.]

To achieve this I tried a method proposed in this StackOverflow post. It uses the following code to obtain a byte[] necessary for MessageDigest:

public static byte[] convertToHashableByteArray(Object obj) {
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    ObjectOutput out = null;
    byte[] byteOutput = null;

    try {
        out = new ObjectOutputStream(bos);
        out.writeObject(obj);
        byteOutput = bos.toByteArray();
    } catch (IOException io) {
        io.printStackTrace();
    } finally {
        try {
            if(out != null) { out.close(); }
        } catch(IOException io) {
            io.printStackTrace();
        }

        try {
            bos.close();
        } catch(IOException io) {
            io.printStackTrace();
        }
    }

    return byteOutput;
}

This, however, causes the problem that only objects implementing the serializable interface will be serialized/converted into a byte[]. To circumvent this issue I applied toString() to the given obj in the catch clause to enforce getting a byte[] in all cases:

public static byte[] convertToHashableByteArray(Object obj) {
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    ObjectOutput out = null;
    byte[] byteOutput = null;

    try {
        out = new ObjectOutputStream(bos);
        out.writeObject(obj);
        byteOutput = bos.toByteArray();
    } catch (IOException io) {
        String stringed = obj.toString();
        byteOutput = stringed.getBytes();
    } finally {
        try {
            if(out != null) { out.close(); }
        } catch(IOException io) {
            io.printStackTrace();
        }

        try {
            bos.close();
        } catch(IOException io) {
            io.printStackTrace();
        }
    }

    return byteOutput;
}

However, this still feels utterly wrong for me. So my question is, whether there is a better alternative to convert arbitrary objects to byte[] to be able to compute hashes. Preferably a solution that works without using additional libraries or one using well established ones like Apache Commons. (Beside that I am also open for other approaches to obtain SHA1/SHA512 hashes of arbitrary Java objects.)

Community
  • 1
  • 1
Marco N.
  • 185
  • 2
  • 8
  • Why you need to serialize an object if you need to take its Hash (hashCode?!?) ? – Davide Lorenzo MARINO Aug 31 '16 at 09:21
  • As far as I know, you can't serialize an object that doesn't implement `serializable` – Cas Aug 31 '16 at 09:22
  • "the default hashcode generates collisions way too often" - where is that said? – Oliver Charlesworth Aug 31 '16 at 09:22
  • @OliverCharlesworth: I read that here: [Link](http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/) And at least for me the reasoning sounds pretty reasonable. – Marco N. Aug 31 '16 at 09:24
  • 1
    Ok I misunderstood the intent here - "using the hash as a key" is almost certainly wrong and going to lead to sad times; even an ideal hash function is subject to the birthday paradox (which is what that article is referring to). Why are you not using the *object* as the key? – Oliver Charlesworth Aug 31 '16 at 09:28
  • Well, yes I am pretty much aware of the issues the birthday paradox can cause. Since I only have a couple 100 to a couple 1000 objects to serialize I was just hoping to create the max. level of collision resistance possible. I actually did not really consider using objects. Actually I am a bit afraid of the potential effects regarding memory usage. Which is another reason I wanted to use hashes. – Marco N. Aug 31 '16 at 09:31
  • Lets put it that way: do you see performance issues? Are writing code for mobile or embedded environments? In other words: what makes you think that dealing with a few 1000 objects will "kill" you unless you focus on such super-subtle details? Or the other way round - you heard that saying about premature optimization being the root of all evil? – GhostCat Aug 31 '16 at 09:32
  • Ok, but (a) there is nothing to suggest that the default hashcode implementation of `Object` is deficient here, and (b) you seem to want to trade micro-optimisation for correctness? – Oliver Charlesworth Aug 31 '16 at 09:33
  • Well, yes I guess I heard that saying. I am just interested in improving my skills, since not thinking about optimization will lead to making mistakes/questionable decisions over and over again. But yes, apparently I will have to stick to `hashCode()` for now, and I will also rethink @OliverCharlesworth's solution of using the objects themselves (I guess the memory argument is valid but I have to admit that it might be way too much micro optimization [since the app is neither for mobile nor embedded systems]). – Marco N. Aug 31 '16 at 09:38
  • 32-bit hash codes as returned by the Java hashCode() leads to chances of duplicates. With 64-bit hash codes, you could store 2 to the power of 32 - 7 or 30 million and above items in our hash table before there is a 0.0031% chance of a collision. – Gopidoss Aug 31 '16 at 10:01

3 Answers3

0

Perhaps you can use UUIDs for your objects as immutable unique identifiers?

San Droid
  • 332
  • 1
  • 7
0

There are so many things wrong here...

  1. You should have proper key classes with equals and hashCode implemented, instead of using random objects.
  2. Serialization performance overhead can easily mean that such map will be slower than even trivial iteration search.
  3. Default hashcode should not be used in most cases, as it might be different for objects which are 'equal' from business point of view. You should reimplement hashcode together with equals (which comes back to point 1). Whenever it has collisions due to pointer aliasing is irrelevant if it won't work poperly
  4. Way overcomplicated method of closing in-memory streams. Just close them one after another, it is not external resource - if it fails, just let it fail, you don't need to close everything 100% in case of failures. You can also use one of closeable utilities (or try/catch with resources) to avoid some overhead
  5. You don't need complicated digest of that byte array - use Arrays.hashCode, it WILL be good enough for your use cases (remember - don't do it anyway, point 1)

If you are still reading and still not willing to implement point 1, go back to point 1. And again. And again.

And to finally answer you question, use hessian serialization.

http://hessian.caucho.com/doc/hessian-overview.xtp

It is very similar to java one, just faster, shorter output and allows serializing objects which do not implement Serializable interface (at the risk of messing things up, you need to set special flag to allow that).

Artur Biesiadowski
  • 3,595
  • 2
  • 13
  • 18
-1

If you want to serialize a given object, i suggest you change your méthod like this :

public static byte[] convertToHashableByteArray(Serializable obj){
     ..........
     ..........
}
florex
  • 943
  • 7
  • 9
  • 1
    `Objects.hashCode` basically just calls `hashCode` on the object. – Oliver Charlesworth Aug 31 '16 at 09:35
  • Well, the `Objects.hashCode(o)` method is only a null-safe version of `Object.hashCode`... if you talk about the `java.util.Objects` class. – René Link Aug 31 '16 at 09:36
  • The second part with `(Serializable obj)` would prevent any Exceptions regarding non-serializable objects that is true. However, it would still not solve the problem of converting non-serializable objects to `byte[]`. – Marco N. Aug 31 '16 at 09:44
  • the class Hashtable implements Serializable. so a hashtable is a serializable object (http://www.tutorialspoint.com/java/util/java_util_hashtable.htm). – florex Aug 31 '16 at 09:52
  • As René Link said, java.util.Objects.hashCode() is a null-safe version of Object.hashCode(). – florex Aug 31 '16 at 09:57