0

I am setting up a hash function that takes the MD5 of an object and tacks on the first four bytes of the object to prevent collisions. These objects can be quite large so I'd prefer to avoid serializing the entire object. What is the most space/time efficient way I can do this?

I've been looking at ObjectOutputStream and while it appears that there is a partial write function, it seems to require that I've already converted the object into a byte array.

Daniel Imberman
  • 618
  • 1
  • 5
  • 18
  • 1
    What is the purpose of the serialization? Normally you'd like to be able to deserialize the byte stream to get the original object. Did you consider using `transient` fields? You can mark any field in your class as transient if you don't want to persist it. – Ivan Koblik Jul 11 '14 at 21:36
  • 2
    What do you mean by "MD5 of an object"? What do you mean by "prevent collisions"? I don't think you're expressing your question very well. I also think ObjectOutputStream is already doing all of that, but you could be asking for something else, I'm not really sure. – markspace Jul 11 '14 at 21:38
  • writeObject(Object obj) is the function u are looking for where u can write logic to serialize object partially. – Vipin Jul 11 '14 at 21:40
  • So essentially my goal is to create the most space efficient data structure possible. Part of how I've been doing it so far is that I've been taking whatever object is being sent in as a key and just using the hashCode() function to store it in an int array. My boss is concerned about this allowing collisions (Which it probably would) so now I'm back to the drawing board. My initial thought was that since hashing algs (like md5) are unable to have collisions in similar values, that the first 4 bytes are guaranteed to be different for objects with the same hash, which would prevent collisions. – Daniel Imberman Jul 11 '14 at 21:48
  • 2
    But tacking on the 'first four bytes of the object' *won't* prevent collisions. They aren't unique. MD5 is already almost certainly strong enough. You won't get the first four bytes of the object via serialization easily, as there is a stream header, serialization protocol tags, etc. to be navigated first. 'Prevent collisions' and 'space efficient' are contrary goals. You don't need this. – user207421 Jul 11 '14 at 21:57
  • @EJP Thank you, so if i were trying to find a way to prevent collisions keeping space and time constraints in mind (I'm on a big data team so small inefficiencies grow very quickly) would you recommend that I just use MD5 or is there another algorithm/method I should consider? – Daniel Imberman Jul 11 '14 at 22:03
  • 1
    I've answered that. 'MD5 is almost certainly strong enough'. – user207421 Jul 11 '14 at 22:04
  • ok thank you, I was just making sure there wasn't a faster/more space efficient solution. – Daniel Imberman Jul 11 '14 at 22:08
  • I think the best way to 'get the best thing possible' would be to use a library rather than roll your own. Here's a [discussion on SO of Java serialization.](http://stackoverflow.com/questions/239280/which-is-the-best-alternative-for-java-serialization) (I like Kryo.) Then [GZIP](http://docs.oracle.com/javase/7/docs/api/java/util/zip/GZIPOutputStream.html) the result. – markspace Jul 11 '14 at 22:12

1 Answers1

2

I am setting up a hash function that takes the MD5 of an object and tacks on the first four bytes of the object to prevent collisions.

But tacking on the 'first four bytes of the object' won't prevent collisions. They aren't unique. MD5 is already almost certainly strong enough.

These objects can be quite large so I'd prefer to avoid serializing the entire object. What is the most space/time efficient way I can do this?

I've been looking at ObjectOutputStream and while it appears that there is a partial write function, it seems to require that I've already converted the object into a byte array.

You won't get the first four bytes of the object via serialization easily, as there is a stream header, serialization protocol tags, etc. to be navigated first. Not that there any point.

Re your comments, 'prevent collisions' and 'space efficient' are contrary goals.

You don't need this.

user207421
  • 305,947
  • 44
  • 307
  • 483