If you want to compare if the two dataframes are equal. I found this solution to be work with reasonable computational cost.
I am using Databricks SQL.
I compute a hash function for each row and then a FULL OUTER EXCLUSIVE JOIN
. If the resulting dataframe is empty, then the tables are equal.
WITH t1 AS (
SELECT
hash(*) AS hash_column
FROM table_1
),
t2 AS (
SELECT
hash(*) AS hash_column
FROM table_2
)
SELECT *
FROM t1
FULL OUTER JOIN t2 ON
t1.hash_column = t2.hash_column
WHERE
t1.hash_column IS NULL
OR
t2.hash_column IS NULL
Just take care, the hash function returns an Integer (32 bits). If your table is too big (about 10^6 unique values) you will increase the chances of collision (when different values produce the same hash).
I try experimenting with the crc32, which returns a BIGINT (64 bits), but the crc32(null) = null, so it cannot be used in the FULL OUTER JOIN