Is it possible to find the hash (preferably hash 256) value of the full PySpark dataframe. I dont want to find hash of individual rows or columns. I know function exists in pySpark for column level hash calculation from pyspark.sql.functions import sha2
The requirement is to partiton a big dataframe based on years and for each year(small dataframes) find the hash value and persist the result in a table.
Input (Product, Qauntity, Store, SoldDate)
Read the data in a dataframe, partition by SoldDate, calculate the hash for each partition and write to a file/table.
Output: (Date, hash)
The reason I am doing this is I have to compare the run this process daily and then check whether the hash changed for any previous dates.
There is file level md5 possible but dont want to generate files but calcualte hash on the fly for the partitions/small dataframes based on dates