From How do I calculate the MD5 checksum of a file in Python? , I wrote a script to remove the duplicate files in the folder dst_dir
with md5. However, for many files(.jpg and .mp4), the md5 was not able to remove the duplicate files. I checked that the methods mentioned in Python 3 same text but different md5 hashes did not work. I suspect if might be the property file(the "modification date" etc.) that's attached to the image files that's changed.
import os
dst_dir="/"
import hashlib
directory=dst_dir;
#list of file md5
md5_list=[];
md5_file_list=[];
for root, subdirectories, files in os.walk(directory):
if ".tresorit" not in root:
for file in files:
file_path =os.path.abspath( os.path.join(root,file) );
print(file_path)
# Open,close, read file and calculate MD5 on its contents
with open(file_path, 'rb') as file_to_check:
# read contents of the file
data = file_to_check.read()
# pipe contents of the file through
md5_returned = hashlib.md5(data).hexdigest()
if md5_returned not in md5_list:
md5_list.append(md5_returned);
md5_file_list.append(file_path);
else:
# remove duplicate file
print(["Duplicate file", file_path, md5_returned] )
if "-" not in file:
os.remove(file_path);
print("Duplicate file removed 01")
else:
file_list_index=md5_list.index(md5_returned);
if "-" not in md5_file_list[file_list_index]:
os.remove(md5_file_list[file_list_index]);
del md5_list[file_list_index]
del md5_file_list[file_list_index]
print("Duplicate file removed 02")
md5_list.append(md5_returned)
md5_file_list.append(file_path)
else:
os.remove(file_path);
print("Duplicate file removed 03")
How to fix Python md5 calculation such that the same image files could be returned with the same md5 values?