Does pyspark hash guarantee unique result for different input?

Asked Aug 07 '19 at 03:46

Active Aug 07 '19 at 14:55

Viewed 465 times

I am using pyspark.sql.functions.hash on a given set of columns, and expect to get a different output for different rows. I noticed that I am getting the same hash back although the values in the input rows was different. Is this expected? or a bug?

df = df.withColumn("my_key", F.hash(["some other columns"])

This is obviously doesn't happen all time so hard to reproduce.

edited Aug 07 '19 at 14:55

pault

41,343
15
107
149

asked Aug 07 '19 at 03:46

Eran Witkon

4,042
4
19
20

Try to write UDF with standard Python hash function. – gorros Aug 07 '19 at 08:58
1

[`pyspark.sql.functions.hash`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.hash) uses the [murmur3 hash function](https://stackoverflow.com/questions/53634650/hash-function-in-spark), which is *NOT* a cryptographic hash function. You should instead use [`pyspark.sql.functions.sha2`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sha2) which implements the SHA-2 family of cryptographic hash functions. [Is it safe to ignore the possibility of SHA collisions in practice?](https://stackoverflow.com/a/4014407/5858851) – pault Aug 07 '19 at 14:16

Does pyspark hash guarantee unique result for different input?

0 Answers0