4

We have two different libraries, one in Python and one in Go that need to compute murmur3 hashes identically. Unfortunately no matter how hard we try, we cannot get the libraries to produce the same result. It appears from this SO question about Java and Python that compatibility isn't necessarily straight forward.

Right now we're using the python mmh3 and Go github.com/spaolacci/murmur3 libraries.

In Go:

hash := murmur3.New128()
hash.Write([]byte("chocolate-covered-espresso-beans"))
fmt.Println(base64.RawURLEncoding.EncodeToString(hash.Sum(nil)))
// Output: cLHSo2nCBxyOezviLM5gwg

In Python:

name = "chocolate-covered-espresso-beans"
hash = mmh3.hash128(name.encode('utf-8'), signed=False).to_bytes(16, byteorder='big', signed=False)
print(base64.urlsafe_b64encode(hash).decode('utf-8').strip("="))
# Output: jns74izOYMJwsdKjacIHHA (big byteorder)

hash = mmh3.hash128(name.encode('utf-8'), signed=False).to_bytes(16, byteorder='little', signed=False)
print(base64.urlsafe_b64encode(hash).decode('utf-8').strip("="))
# Output: HAfCaaPSsXDCYM4s4jt7jg (little byteorder)

hash = mmh3.hash_bytes(name.encode('utf-8'))
print(base64.urlsafe_b64encode(hash).decode('utf-8').strip("="))
# Output: HAfCaaPSsXDCYM4s4jt7jg

In Go, murmur3 returns a uint64 so we assume signed=False in Python; however we also tried signed=True and did not get matching hashes.

We're open to different libraries, but are wondering if there is something wrong with either our Go or Python methodologies of computing a base64 encoded hash from a string. Any help appreciated.

bbengfort
  • 5,254
  • 4
  • 44
  • 57
  • Here is a gist for various testing we're doing if it helps: https://gist.github.com/bbengfort/fed9f92142b31b0261fc71fdb0d168a5 – bbengfort Apr 03 '23 at 16:14
  • 1
    There are two different versions of the murmurhash3 algorithm, one optimized for x86 and one optimized for x64, that produce different hash values. It's possible you're not using the same algorithm on both platforms. – user2357112 Apr 03 '23 at 16:23
  • Good thought; I assumed that the x64 optimizations would be used on my machine but perhaps I should check that out in more detail. – bbengfort Apr 03 '23 at 17:09
  • Don't they both produce ints? Why don't you look at those, instead of adding multiple steps to turn them into strings? Maybe the difference is actually in those additional steps... – Kelly Bundy Apr 03 '23 at 17:26
  • Your Go code isn't valid. – Kelly Bundy Apr 03 '23 at 17:28
  • Ah, sorry missed closing parentheses - it is fixed now. We did check to see if the encoding was the problem, using long decimal, hex, and base64 encodings of the 128-bit number that was returned and we were still having the problem with non-matching hashes. – bbengfort Apr 03 '23 at 20:55

1 Answers1

3

That first Python result is almost right.

>>> binascii.hexlify(base64.b64decode('jns74izOYMJwsdKjacIHHA=='))
b'8e7b3be22cce60c270b1d2a369c2071c'

In Go:

    x, y := murmur3.Sum128([]byte("chocolate-covered-espresso-beans"))
    fmt.Printf("%x %x\n", x, y)

Results in:

70b1d2a369c2071c 8e7b3be22cce60c2

So the order of the two words is flipped. To get the same result in Python, you can try something like:

name = "chocolate-covered-espresso-beans"
hash = mmh3.hash128(name.encode('utf-8'), signed=False).to_bytes(16, byteorder='big', signed=False)
hash = hash[8:] + hash[:8]
print(base64.urlsafe_b64encode(hash).decode('utf-8').strip("="))
# cLHSo2nCBxyOezviLM5gwg
kichik
  • 33,220
  • 7
  • 94
  • 114
  • 1
    Good eye; not sure how you spotted that! I guess both Go and Python's murmur128 are using 2 64-bit unsigned ints but returning them in different orders. Seems like that could cause a few problems with the base specification. – bbengfort Apr 03 '23 at 17:20
  • 1
    It's become one of those "check first" things for me after dealing with it so many times in the past. Seeing the data in hex makes these kind of issues easier to spot. – kichik Apr 03 '23 at 17:51
  • I've opened issues in both the Go and Python repositories to see if there is a way we can find a fix wrt to the murmur3 reference code. – bbengfort Apr 04 '23 at 22:57