0

I am looking for a portable algorithm for creating a hashCode for binary data. None of the binary data is very long -- I am Avro-encoding keys for use in kafka.KeyedMessages -- we're probably talking anywhere from 2 to 100 bytes in length, but most of the keys are in the 4 to 8 byte range.

So far, my best solution is to convert the data to a hex string, and then do a hashCode of that. I'm able to make that work in both Scala and JavaScript. Assuming I have defined b: Array[Byte], the Scala looks like this:

b.map("%02X" format _).mkString.hashCode

It's a little more elaborate in JavaScript -- luckily someone already ported the basic hashCode algorithm to JavaScript -- but the point is being able to create a Hex string to represent the binary data, I can ensure the hashing algorithm works off the same inputs.

On the other hand, I have to create an object twice the size of the original just to create the hashCode. Luckily most of my data is tiny, but still -- there has to be a better way to do this.

Instead of padding the data as its hex value, I presume you could just coerce the binary data into a String so the String has the same number of bytes as the binary data. It would be all garbled, more control characters than printable characters, but it would be a string nonetheless. Do you run into portability issues though? Endian-ness, Unicode, etc.

Incidentally, if you got this far reading and don't already know this -- you can't just do:

val b: Array[Byte] = ...
b.hashCode

Luckily I already knew that before I started, because I ran into that one early on.

Update

Based on the first answer given, it appears at first blush that java.util.Arrays.hashCode(Array[Byte]) would do the trick. However, if you follow the javadoc trail, you'll see that this is the algorithm behind it, which is as based on the algorithm for List and the algorithm for byte combined.

int hashCode = 1;
for (byte e : list)  hashCode = 31*hashCode + (e==null ? 0 : e.intValue());

As you can see, all it's doing is creating a Long representing the value. At a certain point, the number gets too big and it wraps around. This is not very portable. I can get it to work for JavaScript, but you have to import the npm module long. If you do, it looks like this:

function bufferHashCode(buffer) {
  const Long = require('long');
  var hashCode = new Long(1);
  for (var value of buff.values()) { hashCode = hashCode.multiply(31).add(value) }
  return hashCode
}

bufferHashCode(new Buffer([1,2,3]));
// hashCode = Long { low: 30817, high: 0, unsigned: false }

And you do get the same results when the data wraps around, sort of, though I'm not sure why. In Scala:

java.util.Arrays.hashCode(Array[Byte](1,2,3,4,5,6,7,8,9,10))
// res30: Int = -975991962

Note that the result is an Int. In JavaScript:

bufferHashCode(new Buffer([1,2,3,4,5,6,7,8,9,10]);
// hashCode = Long { low: -975991962, high: 197407, unsigned: false }

So I have to take the low bytes and ignore the high, but otherwise I get the same results.

Community
  • 1
  • 1
David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • 1
    If it needs to be portable, then shouldn't you use something like md5? There must be APIs/implementations of md5 in all the languages you mentioned, right? – DavidS May 04 '16 at 22:06
  • Or more broadly [some fast non-cryptographic hash](http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed). – DavidS May 04 '16 at 22:14

2 Answers2

1

This functionality is already available in Java standard library, look at the Arrays.hashCode() method.

Because your binary data are Array[Byte], here is how you can verify it works:

println(java.util.Arrays.hashCode(Array[Byte](1,2,3))) // prints 30817
println(java.util.Arrays.hashCode(Array[Byte](1,2,3))) // prints 30817
println(java.util.Arrays.hashCode(Array[Byte](2,2,3))) // prints 31778

Update: It is not true that the Java implementation boxes the bytes. Of course, there is conversion to int, but there's no way around that. This is the Java implementation:

public static int hashCode(byte a[]) {
    if (a == null) return 0;
    int result = 1;
    for (byte element : a) result = 31 * result + element;
    return result;
}

Update 2 If what you need is a JavaScript implementation that gives the same results as a Scala/Java implementation, than you can extend the algorithm by, e.g., taking only the rightmost 31 bits:

def hashCode(a: Array[Byte]): Int = {
  if (a == null) {
    0
  } else {
    var hash = 1
    var i: Int = 0
    while (i < a.length) {
      hash = 31 * hash + a(i)
      hash = hash & Int.MaxValue // taking only the rightmost 31 bits
      i += 1
    }
    hash
  }
}

and JavaScript:

var hashCode = function(arr) {
    if (arr == null) return 0; 
    var hash = 1;
    for (var i = 0; i < arr.length; i++) {
        hash = hash * 31 + arr[i]
        hash = hash % 0x80000000 // taking only the rightmost 31 bits in integer representation
    }
    return hash;
}

Why do the two implementations produce the same results? In Java, integer overflow behaves as if the addition was performed without loss of precision and then bits higher than 32 got thrown away and & Int.MaxValue throws away the 32nd bit. In JavaScript, there is no loss of precision for integers up to 253 which is a limit the expression 31 * hash + a(i) never exceeds. % 0x80000000 then behaves as taking the rightmost 31 bits. The case without overflows is obvious.

Mifeet
  • 12,949
  • 5
  • 60
  • 108
  • Any idea what the algorithm is -- the question is portability too. How does it work, am I going to be able to recreate it in a non-Java based language? – David Griffin May 04 '16 at 16:33
  • See my update in the question, this algorithm isn't very portable. – David Griffin May 04 '16 at 17:17
  • @DavidGriffin So what you want is not a hashing algorithm in Scala/Java and Javascript, but an algorithm that *gives the same result* in both? If so, see my updated answer. – Mifeet May 07 '16 at 14:32
  • I assume no worse in terms of collisions since the the result is an `Int` either way? – David Griffin May 07 '16 at 14:35
  • Collision rate depends on the usage pattern. Hashtables often have size which are a power of 2. If you use simple modulo table size, the precision is the same as the `Arrays` version. If you use the same algorithm as Java `HashMap`, you may loose some precision for tables larger then 2^17. Generally, if you really need as little collisions as possible, it's better to use some other hash function, e.g., [Murmur hash](https://en.wikipedia.org/wiki/MurmurHash) or what DavidS suggested in the comments. Otherwise the function above is good enough. – Mifeet May 07 '16 at 14:58
1

This is the meat of algorithm used in the Java library:

  int result 1;
  for (byte element : a) result = 31 * result + element;

You comment:

this algorithm isn't very portable

Incorrect. If we are talking about Java, then provided that we all agree on the type of the result, then the algorithm is 100% portable.

Yes the computation overflows, but it overflows exactly the same way on all valid implementations of the Java language. A Java int is specified to be 32 bits signed two's complement, and the behavior of the operators when overflow occurs is well-defined ... and the same for all implementations. (The same goes for long ... though the size is different, obviously.)

I'm not an expert, but my understanding is that Scala's numeric types have the same properties as Java. Javascript is different, being based on IEE 754 double precision floating point. However, with case you should be able to code the Java algorithm portably in Javascript. (I think @Mifeet's version is wrong ...)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • @DavidGriffin - What isn't correct? Mifeet's Javascript version, or my answer? – Stephen C May 05 '16 at 05:04
  • Sorry, was on a phone and it hit return before I meant -- Mifeet's JavaScript version is not correct. You have to use a library to implement proper `Long` handling. At least he library I used wraps around the same way as the Java one. – David Griffin May 05 '16 at 05:13
  • Alternatively, he could use bit-masking after each iteration. Apparently Javascript converts a number into a 32bit int before doing a bitwise operation. – Stephen C May 05 '16 at 07:36