2

I have a table with N columns, I want to concatenate them all to a string column and then perform a hash on that column. I have found a similar question in Scala.

I want to do this entirely inside of Spark SQL ideally, I have tried HASH(*) as myhashcolumn but due to several columns being sometimes null I can't make this work as I would expected.

If I have to create a UDF and register it to make this happen, I need to use Python and not Scala as all my other code is in Python.

Any ideas?

sync11
  • 1,224
  • 2
  • 10
  • 23
Scott Bell
  • 161
  • 2
  • 11

3 Answers3

4

Try below code.

df.select([hash(col) for col in df.columns]).show()
Balaji Reddy
  • 5,576
  • 3
  • 36
  • 47
Neeraj Bhadani
  • 2,930
  • 16
  • 26
  • I think there are two issues with the above snippet. Firstly the above code will provide the hash of each col individually while the requirement is to create a single hash value considering all the cols. Secondly it is using variable name as 'col' which is actually a key word considering the pyspark.sql package. this will result into erroneous result. i think this should be a better approach df.select('*',hash(*df.columns)).show() same is mentioned by @d-to-the-k as wel – Kaushik Ghosh May 19 '23 at 11:38
3

You can do it in pyspark likes the following (just pass input columns to the function):

new_df = df.withColumn("contcatenated", hash_function(col("col1"), col("col2"), col("col3")))
OmG
  • 18,337
  • 10
  • 57
  • 90
  • Thanks, is there to map all columns dynamically? The reason, I'm not listening them inside of my sql is because this doesn't seem possible. – Scott Bell Nov 26 '18 at 11:58
2

If you want to generate a hash based on all the columns of a DataFrame dynamically, you can use this:

import pyspark.sql.functions as F

df.withColumn("checksum", F.xxhash64(*df.schema.names))

Explanation:
df.schema.names is a list with the names of all the columns in the DataFrame df. Using a * spreads this list into the elements it contains. You can then pass the elements to functions such as xxhash64 (for 64-bit hashes) and hash (for 32-bit hashes).

d-to-the-k
  • 36
  • 1