1

I have a simple question for PySpark hash function.

I have checked that in Scala, Spark uses murmur3hash based on Hash function in spark.

I want to know what algorithm is exactly used for hash function in PySpark (https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash).

Could anyone answer this question? I also want to know the code that says the algorithm used in PySpark hash function.

mck
  • 40,932
  • 13
  • 35
  • 50
CI L'OC
  • 25
  • 5

2 Answers2

2

Please note that reproducing the hash values outside PySpark is not trivial, at least in python. PySpark uses an implementation of this algorithm which doesn't give the same result when the libraries are run in python.

Even Scala & PySpark's hash algorithms aren't directly compatible. The reason for this is explained in https://stackoverflow.com/a/46472986/10999642

So if reproducibility in python is important, you can use python's in-built hash function, like so:

udf_hash = F.udf(lambda val: hash(val), T.LongType())
df = df.withColumn("hash", udf_hash("<column name>"))
utkarshgupta137
  • 139
  • 3
  • 4
1

Pyspark is just a wrapper around the Scala Spark code. I believe it uses the same hash function as in Scala Spark.

In your link to the source code, you can see that it calls sc._jvm.functions.hash, which essentially points to the equivalent function in the Scala source code (inside the "JVM").

mck
  • 40,932
  • 13
  • 35
  • 50
  • Thanks a lot for your answer. BTW, how can I confirm that 'sc._jvm.functions.hash' is using murmur3 algorithm? – CI L'OC Apr 11 '21 at 07:45
  • Here is the source for Scala Spark: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2464 . `_jvm` points to the Scala function in a somewhat complex manner, because it needs to go from Python, through the py4j Java gateway into the JVM, and then access the Scala/Java functions. – mck Apr 11 '21 at 07:50