1

I am using pyspark.sql.functions.hash on a given set of columns, and expect to get a different output for different rows. I noticed that I am getting the same hash back although the values in the input rows was different. Is this expected? or a bug?

df = df.withColumn("my_key", F.hash(["some other columns"])

This is obviously doesn't happen all time so hard to reproduce.

pault
  • 41,343
  • 15
  • 107
  • 149
Eran Witkon
  • 4,042
  • 4
  • 19
  • 20
  • Try to write UDF with standard Python hash function. – gorros Aug 07 '19 at 08:58
  • 1
    [`pyspark.sql.functions.hash`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.hash) uses the [murmur3 hash function](https://stackoverflow.com/questions/53634650/hash-function-in-spark), which is *NOT* a cryptographic hash function. You should instead use [`pyspark.sql.functions.sha2`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sha2) which implements the SHA-2 family of cryptographic hash functions. [Is it safe to ignore the possibility of SHA collisions in practice?](https://stackoverflow.com/a/4014407/5858851) – pault Aug 07 '19 at 14:16

0 Answers0