3

Scala MurmurHash3 library not matching Spark Hash function Both scala and spark uses same Murmur hash 3 implementation but results are different. Any idea?

drlol
  • 333
  • 4
  • 18
  • 2
    Looks like they have their own implementation of the algorithm: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java – Yuval Itzchakov Jul 27 '20 at 10:08
  • @YuvalItzchakov is there any way I can use the same spark hash in scala to match a string? – drlol Jul 27 '20 at 12:54
  • 1
    The class itself is `public`. What we need to understand is if it gets packaged as one of the spark packages, such as `spark-core`. I'd try searching for it in the loaded jars, or at least trying to create an instance. – Yuval Itzchakov Jul 27 '20 at 13:08

1 Answers1

4

I found a way to match a string in scala which is the same spark hash -

As spark uses Guava's implementation of Murmur3_x86_32 we can simply write tas below to match a string -

Seed Value used in spark = 42

String format = UTF8

import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32._

 

   val s = UTF8String.fromString("Formatted String Goes Here")
   
   hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), 42.toInt)

which returns the same Hash code as in spark hash function.

drlol
  • 333
  • 4
  • 18