0

From How do I calculate the MD5 checksum of a file in Python? , I wrote a script to remove the duplicate files in the folder dst_dir with md5. However, for many files(.jpg and .mp4), the md5 was not able to remove the duplicate files. I checked that the methods mentioned in Python 3 same text but different md5 hashes did not work. I suspect if might be the property file(the "modification date" etc.) that's attached to the image files that's changed.

import os
dst_dir="/"
import hashlib

directory=dst_dir;
#list of file md5
md5_list=[];
md5_file_list=[];


for root, subdirectories, files in os.walk(directory):
    
    if ".tresorit" not in root:
        for file in files:
            file_path =os.path.abspath( os.path.join(root,file) );
            print(file_path)

            # Open,close, read file and calculate MD5 on its contents 
            with open(file_path, 'rb') as file_to_check:
                # read contents of the file
                data = file_to_check.read()    
                # pipe contents of the file through
                md5_returned = hashlib.md5(data).hexdigest()

            if md5_returned not in md5_list:
                md5_list.append(md5_returned);
                md5_file_list.append(file_path);
                
            else:
                # remove duplicate file 
                
                print(["Duplicate file", file_path, md5_returned] )
                if "-" not in file:
                    os.remove(file_path);
                    print("Duplicate file removed 01")
                else:
                    file_list_index=md5_list.index(md5_returned);
                    
                    if "-" not in md5_file_list[file_list_index]:
                        os.remove(md5_file_list[file_list_index]);
                        
                        del md5_list[file_list_index]
                        del md5_file_list[file_list_index]
                        print("Duplicate file removed 02")
                        
                        md5_list.append(md5_returned)
                        md5_file_list.append(file_path)
                    else:
                        os.remove(file_path);
                        print("Duplicate file removed 03")

How to fix Python md5 calculation such that the same image files could be returned with the same md5 values?

  • If I draw a perfectly white image 640x480 pixels and save it as a PNG, is it the same image as if I save it as a GIF? Do you think they'll have the same md5 hash? What if I save it as a TIFF, with 16 bits/pixel? Or 32-bits? What if I add a copyright as EXIF, is it still the same image? Do you think the md5 will be the same? – Mark Setchell May 08 '23 at 22:34
  • 1
    This could use some clarification. I infer that while the script analyzes images of multiple formats, only exact duplicates (literal copies of the same source file) should be detected. Is this accurate, ShoutOutAndCalculate? Or is Mark correct that you're trying to detect the same image when present in different file formats? – CrazyChucky May 08 '23 at 22:36
  • 1
    See https://stackoverflow.com/a/28834788/2836621 and https://stackoverflow.com/a/54053080/2836621 – Mark Setchell May 08 '23 at 22:39
  • 1
    @CrazyChucky It's not just file formats, it's also different bit-depths, different meta-data, different compression, different encoding... – Mark Setchell May 08 '23 at 22:42
  • 1
    @MarkSetchell I just want the code to analysis and delete what supposed to be the exact copy, i.e. "file 1.jpg" and "file 1 copy.jpg" and "file 2.mp4" and "file 2 copy.mp4". However, somehow the md5 for what supposed to be the exact file ran to be different. – ShoutOutAndCalculate May 09 '23 at 02:55
  • 1
    @CrazyChucky yes, it was supposed to be the exact copy. But one file was uploaded on the server and the other one was transferred through usb. They had the same file size etc. – ShoutOutAndCalculate May 09 '23 at 03:33
  • @MarkSetchell I read the links and did some google. It seemed that the metadata or the EXIF files needed to be excluded from the ".mp4" file or the ".jpg" file. Is there a way to do it consistently, just for any file format? – ShoutOutAndCalculate May 09 '23 at 03:55
  • 1
    If you are looking for files that are absolutely, byte-for-byte identical, then md5 hash is the way to go. Try using a command-line tool (outside of Python) to hand-check files you are having issues with. – Mark Setchell May 09 '23 at 06:58

0 Answers0