How to get the hash for a whole dataframe?

Question

I'm having difficulty finding anything covering this.

This answer is similar - Get the same hash value for a Pandas DataFrame each time. I'm looking for the same logic of returning a sha256 repeatably when passing in a dataframe, but using databricks / spark dataframes rather than Pandas.

Thanks,

What is your use case? What do you really want to achieve? This is not available out of the box as far as I know. You can hash each record, collect these hashes as an array to the driver if small enough and hash that again. — Georg Heiler, Sep 30 '20 at 21:42
The dataframes can be very large i.e: 100 TB, you can not insert such a key into the hash function. If you need to compare dataframes you can set a unique identifier for each record, let's call it row_id (this could also be the hash of each row). Next, you can compare the dataframes by comparing the two sets of the datasets and add more criteria such as row count — abiratsis, Oct 01 '20 at 15:01
Unless the data is small enough to be collected on the driver, some kind of [reduce](https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/Dataset.html#reduce(func:(T,T)=%3ET):T) function will be involved to enable parallelism. This function must be must be commutative and associative. Is it a hard requirement to use sha256? Or could it be any function? — werner, Oct 04 '20 at 16:05
+1 for the question, ideally to have analogy of Snowflake's HASH_AGG(*) function, for quick and universal comparison of big tables. All this "approaches" sounds like very slow hacks — VB_, Nov 08 '22 at 13:51

Sergio Polimante · Answer 1 · 2023-08-31T21:48:19.977

If you want to compare if the two dataframes are equal. I found this solution to be work with reasonable computational cost.

I am using Databricks SQL.

I compute a hash function for each row and then a FULL OUTER EXCLUSIVE JOIN. If the resulting dataframe is empty, then the tables are equal.

WITH t1 AS (
   SELECT
   hash(*) AS hash_column
   FROM table_1
),
t2 AS (
SELECT
hash(*) AS hash_column
FROM table_2
)

SELECT *
FROM t1
FULL OUTER JOIN t2 ON
    t1.hash_column = t2.hash_column
WHERE
t1.hash_column IS NULL
OR 
t2.hash_column IS NULL

Just take care, the hash function returns an Integer (32 bits). If your table is too big (about 10^6 unique values) you will increase the chances of collision (when different values produce the same hash).

I try experimenting with the crc32, which returns a BIGINT (64 bits), but the crc32(null) = null, so it cannot be used in the FULL OUTER JOIN

How to get the hash for a whole dataframe?

1 Answers1