Scala MurmurHash3 library not matching Spark Hash function Both scala and spark uses same Murmur hash 3 implementation but results are different. Any idea?
Asked
Active
Viewed 1,254 times
3
-
2Looks like they have their own implementation of the algorithm: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java – Yuval Itzchakov Jul 27 '20 at 10:08
-
@YuvalItzchakov is there any way I can use the same spark hash in scala to match a string? – drlol Jul 27 '20 at 12:54
-
1The class itself is `public`. What we need to understand is if it gets packaged as one of the spark packages, such as `spark-core`. I'd try searching for it in the loaded jars, or at least trying to create an instance. – Yuval Itzchakov Jul 27 '20 at 13:08
1 Answers
4
I found a way to match a string in scala which is the same spark hash -
As spark uses Guava's implementation of Murmur3_x86_32 we can simply write tas below to match a string -
Seed Value used in spark = 42
String format = UTF8
import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32._
val s = UTF8String.fromString("Formatted String Goes Here")
hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), 42.toInt)
which returns the same Hash code as in spark hash function.

drlol
- 333
- 4
- 18