Zero-Allocation-Hashing murmur3: hashChars() and hashBytes() produce different output

Question

I am not sure if I am using murmur3 (OpenHFT's zero-allocation-hashing) function correctly but the result seems different for hashChars() and hashBytes()

// Using zero-allocation-hashing 0.16  
String input = "abc123";
System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8)));

Output:

-4878457159164508227
-7432123028918728600

The latter one produces the same output as Guava lib.

Which function should be used for String inputs?

Shouldn't both functions produce the same result?

Update:

How can I get same output as :

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();
Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()

using zero-allocation-hashing lib which seems to be faster than Guava

You could implement your own `Access`, but an efficient implementation is only possible with ASCII input, as this API assumes random access to the resulting byte sequence (and creating a UTF-8 representation beforehand would contradict the “zero-allocation” goal). — Holger, Jun 12 '23 at 14:28

score 2 · Answer 1 · answered Jun 12 '23 at 09:43

The size of a char and byte are different in Java:

char size is 16 bits, using the Unicode character set
byte actually respond to it's name, 8 bits long

This difference becomes crucial when we consider different characters: considering a simple character like 'A' - in Unicode, it's represented by the hexadecimal number 0x0041, so in our example:

String input = "A";
byte[] bytes = input.getBytes(StandardCharsets.UTF_8);

System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(bytes));

hashChars is working with two bytes (0x00, 0x41), while hashBytes is working with one byte (0x41) -- this is why you will get different results.

Which function to use really depends on your requirements: if you're hashing strings and you wanna ignore the underlying encoding, use hashChars(). If you care about the specific byte representation, use hashBytes().

I want same output as `Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong()` and `Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()` using `zero-allocation-hashing` lib which seems to be faster than `Guava` — Nishant Kumar, Jun 12 '23 at 09:56

xerx593 · Accepted Answer · 2023-06-12T11:33:43.280

1

Your assumption regarding UTF-8 is not correct, it holds for StandardCharsets.UTF_16LE.

String input = "abc123";

System.out.println(LongHashFunction.murmur_3().hashChars(input));
System.out.println(LongHashFunction.murmur_3().hashBytes(
  input.getBytes(StandardCharsets.UTF_16LE)
));

gives:

-4878457159164508227
-4878457159164508227

Additional Answer

For the desired:

Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong();

this:

LongHashFunction.murmur_3().hashBytes(input.getBytes(StandardCharsets.UTF_8));

seems to work (please test more!)

The (hex) string conversion is sort of a problem, since the guava hash creates (really) 128 bits (16 bytes, 2 longs), whereas "your lib" gives us only 64 bits!

~~Half of the digits i can reproduce with: ...~~

thx to:

With your help (sorry first time encounter this lib), I could finally:

System.out.println("Actual:   " +
    toHexString(
        LongTupleHashFunction.murmur_3().hashBytes(
            input.getBytes(StandardCharsets.UTF_8)
        )
    )
);

where:

private static final String toHexString(long[] hashLongs) {
    StringBuilder sb = new StringBuilder(hashLongs.length * Long.BYTES * 2);
    for (long lng : hashLongs)
        for (int i = 0; i < Long.BYTES; i++) {
            byte b = (byte) (lng >> (i * Long.BYTES));
            sb.append(HEX_DIGITS[(b >> 4) & 0xf]).append(HEX_DIGITS[b & 0xf]);
        }
    return sb.toString();
}

private static final char[] HEX_DIGITS = "0123456789abcdef".toCharArray();

edited Jun 12 '23 at 11:33

answered Jun 12 '23 at 09:49

xerx593

12,237
5
33
64

I want same output as `Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().asLong()` and `Hashing.murmur3_128().newHasher().putString(input, Charsets.UTF_8).hash().toString()` using `zero-allocation-hashing` lib which seems to be faster than `Guava` – Nishant Kumar Jun 12 '23 at 09:57
..i thought it was "additional question" ;p – xerx593 Jun 12 '23 at 09:59
1

lol. you got me :p – Nishant Kumar Jun 12 '23 at 10:01
see updates ... – xerx593 Jun 12 '23 at 10:07
thanks. any idea about the 2nd part i.e. `toString()` equivalent which produce 128-bit hex string I think? – Nishant Kumar Jun 12 '23 at 10:21
also working on this (but it can also ruin the "seems to be faster"!?;) – xerx593 Jun 12 '23 at 10:36
I would love to see (and learn) the benchmark code and stats. – Nishant Kumar Jun 12 '23 at 10:43
"I would love to see (and learn) the benchmark code and stats." > me2 ! lol – xerx593 Jun 12 '23 at 10:56
1

`LongTupleHashFunction.xx128().hashBytes(s.getBytes(StandardCharsets.UTF_8))` can be used to get long[]. I have also used the same function reference for hex conversion but something seems wrong. – Nishant Kumar Jun 12 '23 at 11:03

Zero-Allocation-Hashing murmur3: hashChars() and hashBytes() produce different output

2 Answers2

Additional Answer