I am looking for a portable algorithm for creating a hashCode for binary data. None of the binary data is very long -- I am Avro
-encoding keys for use in kafka.KeyedMessages
-- we're probably talking anywhere from 2 to 100 bytes in length, but most of the keys are in the 4 to 8 byte range.
So far, my best solution is to convert the data to a hex string, and then do a hashCode
of that. I'm able to make that work in both Scala and JavaScript. Assuming I have defined b: Array[Byte]
, the Scala looks like this:
b.map("%02X" format _).mkString.hashCode
It's a little more elaborate in JavaScript
-- luckily someone already ported the basic hashCode algorithm to JavaScript -- but the point is being able to create a Hex
string to represent the binary data, I can ensure the hashing algorithm works off the same inputs.
On the other hand, I have to create an object twice the size of the original just to create the hashCode. Luckily most of my data is tiny, but still -- there has to be a better way to do this.
Instead of padding the data as its hex value, I presume you could just coerce the binary data into a String so the String has the same number of bytes as the binary data. It would be all garbled, more control characters than printable characters, but it would be a string nonetheless. Do you run into portability issues though? Endian-ness, Unicode, etc.
Incidentally, if you got this far reading and don't already know this -- you can't just do:
val b: Array[Byte] = ...
b.hashCode
Luckily I already knew that before I started, because I ran into that one early on.
Update
Based on the first answer given, it appears at first blush that java.util.Arrays.hashCode(Array[Byte])
would do the trick. However, if you follow the javadoc trail, you'll see that this is the algorithm behind it, which is as based on the algorithm for List and the algorithm for byte
combined.
int hashCode = 1;
for (byte e : list) hashCode = 31*hashCode + (e==null ? 0 : e.intValue());
As you can see, all it's doing is creating a Long
representing the value. At a certain point, the number gets too big and it wraps around. This is not very portable. I can get it to work for JavaScript, but you have to import the npm
module long
. If you do, it looks like this:
function bufferHashCode(buffer) {
const Long = require('long');
var hashCode = new Long(1);
for (var value of buff.values()) { hashCode = hashCode.multiply(31).add(value) }
return hashCode
}
bufferHashCode(new Buffer([1,2,3]));
// hashCode = Long { low: 30817, high: 0, unsigned: false }
And you do get the same results when the data wraps around, sort of, though I'm not sure why. In Scala:
java.util.Arrays.hashCode(Array[Byte](1,2,3,4,5,6,7,8,9,10))
// res30: Int = -975991962
Note that the result is an Int. In JavaScript:
bufferHashCode(new Buffer([1,2,3,4,5,6,7,8,9,10]);
// hashCode = Long { low: -975991962, high: 197407, unsigned: false }
So I have to take the low
bytes and ignore the high
, but otherwise I get the same results.