Spark hash of full dataframe

Asked Dec 04 '22 at 03:15

Active Dec 04 '22 at 10:36

Viewed 185 times

Is it possible to find the hash (preferably hash 256) value of the full PySpark dataframe. I dont want to find hash of individual rows or columns. I know function exists in pySpark for column level hash calculation from pyspark.sql.functions import sha2

The requirement is to partiton a big dataframe based on years and for each year(small dataframes) find the hash value and persist the result in a table.

Input (Product, Qauntity, Store, SoldDate)

Read the data in a dataframe, partition by SoldDate, calculate the hash for each partition and write to a file/table.

Output: (Date, hash)

The reason I am doing this is I have to compare the run this process daily and then check whether the hash changed for any previous dates.

There is file level md5 possible but dont want to generate files but calcualte hash on the fly for the partitions/small dataframes based on dates

edited Dec 04 '22 at 03:16

asked Dec 04 '22 at 03:15

dba

Spark hash of full dataframe

0 Answers0