2

Working on a small script that allows the user to selected a folder to search for duplicate files, whether that be images, text etc. It should then move those duplicate files into another folder of the users choice.

This is the code i have so far:

from tkinter import Tk
from tkinter.filedialog import askdirectory

import os
import shutil

import hashlib

Tk().withdraw()

source = askdirectory(title="Select the source folder")

walker = os.walk(source)
uniqueFiles = dict()
total = 0

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5((open(filepath, "rb").read())).hexdigest()

        if filehash in uniqueFiles:
            print(f"{filepath} is a duplicate")
            total += 1
        else:
            uniqueFiles[filehash] = source

    print("\n# of duplicate files found: {} ".format(total))

    # destination = askdirectory(title="Select the target folder")
    # shutil.move(filepath, destination, copy_function=shutil.copytree)

It works perfectly fine for now, finding all the duplicate files in a folder/sub folders and printing them out. The part im stuck in is how to move them. the commented code at the bottom seems to work but it prompts the user for a folder for every duplicate found. I just want it to list out all the duplicates and then move them at once.

Any ideas on how i could format my code?

Thanks!

Kingspud
  • 65
  • 6
  • You probably just want to move the target directory prompt out of the loop. Either before or after is fine depending on how you want the ui to flow. – joshmeranda Apr 04 '22 at 13:52
  • @joshmeranda Yeah but then wouldnt the shutil.move() command would also be out of the loop as it wont have access to the destination variable? – Kingspud Apr 04 '22 at 14:09
  • Ope you're right, then just prompt before the loop and store the destination or if you prefer after you can store the duplicate files in a dictionary or list of tuples – joshmeranda Apr 04 '22 at 14:25
  • @joshmeranda hmm im a little confused by that sorry, could you elaborate a little more? – Kingspud Apr 04 '22 at 14:37
  • See my answer below – joshmeranda Apr 04 '22 at 16:15

1 Answers1

1

So you have two options here (as described my the comments to your question):

  1. Prompt for the target directory beforehand
  2. Prompt for the target directory afterward

The first option is probably the simplest, most efficient, and requires the smallest amount of refactoring. It does however require the user to input a target directory weather or not there are any duplicate files or an error occurs when searching so might be worse from a user's perspective:

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

        if filehash in uniqueFiles:
            shutil.move(filepath, destination, copy_function=shutil.copytree)
        else:
            uniqueFiles[filehash] = source

The second option would allow you to perform all the necessary checks and error handling, but is more complex and requires more refactoring:

# dictionary of hashes to all files
hashes = {}

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

        if filehash in hashes
            hashes[filehash].append(filepath)
        else:
            hashes[filehash] = [filepath]

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for duplicates in hashes.values():
    if len(duplicates) < 2:
        continue

    for duplicate in hashes:
        shutil.move(duplicate, destination, copy_function=shutil.copytree)

As a side note, I am not familiar with hashlib but I suspect that you will want to be closing the files you are hashing especially if checking a large file tree:

with open(filepath, "rb") as file:
    filehash = hashlib.md5(file.read()).hexdigest()
joshmeranda
  • 3,001
  • 2
  • 10
  • 24
  • 1
    This can be optimised (significantly under certain circumstances) by hashing only files with the same size. See https://stackoverflow.com/questions/748675/finding-duplicate-files-and-removing-them – AcK Apr 07 '22 at 23:33
  • @ack Thats a VERY good point I probably should have mentioned in my answer, thanks for catching that! – joshmeranda Apr 08 '22 at 01:13