The best route depends on your use case. Typically this could be done with MD5 file hashes or with row-wise checksums.
MD5 file-based hash with Spark
You could hash each file with MD5 and store the result somewhere for comparison. You then read a single file at a time and compare it to the stored results to identify duplicates.
MD5 is isn't a splittable algorithm though. It must run sequentially through an entire file. This makes it somewhat less useful with Spark, and means that when opening a large file you can't benefit from Spark's data distribution capabilities.
Instead, to hash a whole file, if you must use Spark, you could use the wholeTextFiles
method, and a map function which calculates the MD5 hash.
This reads in the entire contents of one or more files into a single record in a partition. If you have several executors, but only 1 file, then all executors but one would be idle. With Spark, a record cannot be split across executors so you risk running out of memory if the largest file contents doesn't fit in the executor memory.
Anyway here is what it looks like:
import hashlib
rdd = spark.sparkContext.wholeTextFiles(location)
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
md5_hash = hashlib.md5()
md5_hash.update(file_contents.encode('utf-8'))
return file_name, md5_hash.hexdigest()
rdd.map(map_hash_file).collect()
A benefit to this approach is that if you have many files in a folder you could do the MD5 for each of them in parallel. You must ensure the largest possible file fits into a single record i.e into your executor memory.
With this approach you'd read all the files in your folder each time and compute the hashes in parallel, but don't need to store and retrieve the hashes from somewhere as would be the case if you were processing 1 file at a time.
If you only want to detect duplicates in a folder, and don't mind duplicates across folders, then perhaps this approach would work.
If you also want to detect duplicates across folders, you'd need to read in all of those files too, or just store their hashes somewhere if you already processed them.
MD5 file-based hash without Spark
If you want to store the hashes and process a single file at a time, and therefore want to avoid read in all the files, or if you can't fit the largest file in memory, then you'd need a different approach than Spark.
Since you are using Pyspark, how about using regular Python to read and hash the file? You don't need to read the entire file into memory, you can read it in small chunks and MD5 it serially in that way.
From https://stackoverflow.com/a/1131238/2409299 :
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Then compare the hexdigest
with previously stored hashes to identify a duplicate file.
Row-wise checksum
Using the Spark CSV reader means you've already unpacked the file into rows & columns which means you can no longer compute an accurate file hash. You could instead do row-wise hashes by adding a column with the hash per row, sort the dataset the same way across all the files, and hash down that column to come up with a deterministic result. It would probably be easier to go the file-based MD5 route though.
With Spark SQL and the built-in hash function:
spark.sql("SELECT *, HASH(*) AS row_hash FROM my_table")
Spark's Hash function is not an MD5 algorithm. In my opinion it may not be suitable for this use case. For example, it skips columns that are NULL which can cause hash collisions (false-positive duplicates).
Below hash values are the same:
spark.sql("SELECT HASH(NULL, 'a', 'b'), HASH('a', NULL , 'b')")
+----------------+----------------+
|hash(NULL, a, b)|hash(a, NULL, b)|
+----------------+----------------+
| 190734147| 190734147|
+----------------+----------------+
Other notes
Some data stores such as S3 have object metadata (E-Tag) that acts like a hash. If you are using such a data store you could simply retrieve and compare these to identify duplicates, and avoid hashing any files yourself.