Calculate hash value / checksum of complete file (all data inside file) in Pyspark

Question

I am looking to validate the files, if content of the file is exact duplicate wrt other file (with different name in the same folder). I have read the files using below pyspark code

for file in os.listdir(fileDirectory):
    file_read = spark.read.csv(fileDirectory + '/' + file)

Now, I want to calculate the single-value checksum of whole file. Please advise.

BjornO · Answer 1 · 2021-07-07T09:28:38.900

The best route depends on your use case. Typically this could be done with MD5 file hashes or with row-wise checksums.

MD5 file-based hash with Spark

You could hash each file with MD5 and store the result somewhere for comparison. You then read a single file at a time and compare it to the stored results to identify duplicates.

MD5 is isn't a splittable algorithm though. It must run sequentially through an entire file. This makes it somewhat less useful with Spark, and means that when opening a large file you can't benefit from Spark's data distribution capabilities.

Instead, to hash a whole file, if you must use Spark, you could use the wholeTextFiles method, and a map function which calculates the MD5 hash.

This reads in the entire contents of one or more files into a single record in a partition. If you have several executors, but only 1 file, then all executors but one would be idle. With Spark, a record cannot be split across executors so you risk running out of memory if the largest file contents doesn't fit in the executor memory.

Anyway here is what it looks like:

import hashlib

rdd = spark.sparkContext.wholeTextFiles(location)

def map_hash_file(row):
    file_name = row[0]
    file_contents = row[1]
    md5_hash = hashlib.md5()
    md5_hash.update(file_contents.encode('utf-8'))
    return file_name, md5_hash.hexdigest()

rdd.map(map_hash_file).collect()

A benefit to this approach is that if you have many files in a folder you could do the MD5 for each of them in parallel. You must ensure the largest possible file fits into a single record i.e into your executor memory.

With this approach you'd read all the files in your folder each time and compute the hashes in parallel, but don't need to store and retrieve the hashes from somewhere as would be the case if you were processing 1 file at a time.

If you only want to detect duplicates in a folder, and don't mind duplicates across folders, then perhaps this approach would work.

If you also want to detect duplicates across folders, you'd need to read in all of those files too, or just store their hashes somewhere if you already processed them.

MD5 file-based hash without Spark

If you want to store the hashes and process a single file at a time, and therefore want to avoid read in all the files, or if you can't fit the largest file in memory, then you'd need a different approach than Spark.

Since you are using Pyspark, how about using regular Python to read and hash the file? You don't need to read the entire file into memory, you can read it in small chunks and MD5 it serially in that way.

From https://stackoverflow.com/a/1131238/2409299 :

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Then compare the hexdigest with previously stored hashes to identify a duplicate file.

Row-wise checksum

Using the Spark CSV reader means you've already unpacked the file into rows & columns which means you can no longer compute an accurate file hash. You could instead do row-wise hashes by adding a column with the hash per row, sort the dataset the same way across all the files, and hash down that column to come up with a deterministic result. It would probably be easier to go the file-based MD5 route though.

With Spark SQL and the built-in hash function:

spark.sql("SELECT *, HASH(*) AS row_hash FROM my_table")

Spark's Hash function is not an MD5 algorithm. In my opinion it may not be suitable for this use case. For example, it skips columns that are NULL which can cause hash collisions (false-positive duplicates).

Below hash values are the same:

spark.sql("SELECT HASH(NULL, 'a', 'b'), HASH('a', NULL , 'b')")

+----------------+----------------+
|hash(NULL, a, b)|hash(a, NULL, b)|
+----------------+----------------+
|       190734147|       190734147|
+----------------+----------------+

Other notes

Some data stores such as S3 have object metadata (E-Tag) that acts like a hash. If you are using such a data store you could simply retrieve and compare these to identify duplicates, and avoid hashing any files yourself.

Hi, I tried your first method `MD5 file-based hash with Spark`. However, I get an error `Job aborted due to stage failure`. Traceback shows the error occured at line `rdd.map(map_hash_file).collect()`. Any idea why that would be? — Chipmunk_da, Jan 16 '22 at 21:06
The `Job aborted due to stage failure` is a common error message that could have a variety of causes. Spark logs are verbose but you'd need to provide more info, try reviewing your logs in more detail to identify the root cause, let me know what you find. — BjornO, Feb 01 '22 at 09:04

score 0 · Answer 2 · answered Jul 08 '21 at 11:32

0

I am able to do this part using below code

    file_txt = ''.join(map(str, file_read))
return(sha256(file_txt.encode('utf-8')).hexdigest())

answered Jul 08 '21 at 11:32

manshul goel

71
2
7

Calculate hash value / checksum of complete file (all data inside file) in Pyspark

2 Answers2

Linked