2

Is there a simple and fast python code to identify duplicate files in a directory tree based on filesize and last write time only? (A couple false positives are OK. Forget hash, too slow and not needed to initial ID of potential real dups.)

S/O abounds with similar questions but they tend to utilize md5 or byte-by-byte comparison.

Any suggestions? Or, I need to run the code below and compare to find dup lines in the first two columns? (And maybe run hash only on the ones with matching LWT and size)?

def get_size(filename):
    st = os.stat(filename)
    return str(st.st_size)

def get_last_write_time(filename):
    st = os.stat(filename)
    convert_time_to_human_readable = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
    return convert_time_to_human_readable
HakariDo
  • 277
  • 1
  • 12
  • Correct, so far that's is my best bet. I am wondering if there is a build in module in python or a code which compares the size, LWT and **only** hashes files matching those values, or other solutions... – HakariDo Nov 09 '17 at 05:39
  • Why are you looking at the last write time? If I make a file with "eggs" in it, then ten minutes later another file with "eggs", they are still duplicates, even if the modification time differs by ten minutes. – Wander Nauta Nov 09 '17 at 07:26
  • You are correct, they are duplicates. I agree the size alone could be a quick way to check. But in real life, it is rare to re-create the same exact files, and LWT adds an other layer of filter. Try it out on your work/personal files and see how unique the LWT and sizes really are. Very much in my experience, although I do have examples when different files share same LWT and size. If you dump the .ext in the quick check, then it is even more stringent – HakariDo Nov 09 '17 at 07:32
  • hi again, try this one! ... will check duplicates in a directory – DRPK Nov 09 '17 at 14:06

1 Answers1

0

LOL! Thats my code! :)))))))

Try This ( LAST UPDATE ):

import os, hashlib, time

your_target_folder = "." # change with your target folder.


def size_check(get_path):

    try:
        st = os.stat(get_path)
    except:
        return "Error"

    else:
        return str(st.st_size)


def md5_check(get_path):

    try:
        hash_md5 = hashlib.md5()
        with open(get_path, "rb") as f:
            for chunk in iter(lambda: f.read(2 ** 20), b""):
                hash_md5.update(chunk)
    except:
        return "Error"
    else:
        return hash_md5.hexdigest()


def save_data(get_output):

    with open("./data.txt", 'a') as output_data:
        output_data.write(get_output)



print("Waking On All Files In Your Target Directory and Grabbing Their Personal Hashes, Plz Wait ... \n")

files_and_sizes = {}
for dirpath, _, filenames in os.walk(your_target_folder):

    for items in filenames:

        file_full_path = os.path.abspath(os.path.join(dirpath, items))
        get_size = size_check(file_full_path)

        if get_size in files_and_sizes:
            files_and_sizes[get_size].append(file_full_path)

        else:
            files_and_sizes[get_size] = [file_full_path]


new_dict = {}
error_box = []

for key, box_name in files_and_sizes.items():

    if not key == "Error" and len(box_name) > 1:

        for files in box_name:

            get_file_hash = md5_check(files)

            if not get_file_hash == "Error":

                if get_file_hash in new_dict:
                    new_dict[get_file_hash].append(files)

                else:
                    new_dict[get_file_hash] = [files]

            else:
                error_box.append(files)

    elif key == "Error" and len(box_name) > 0:

        do = [error_box.append(error_files) for error_files in box_name]

    else:
        pass


for hashes, names in new_dict.items():

    if len(names) > 1:

        for each_files in names:

            result = each_files + "\n"
            print(result)
            save_data(result)

    else:
        pass

if len(error_box) > 0:
    print("Something Went Wrong On These Files ( I could not access them ): " + str(error_box) + "\n")


print("Good By.")

Good Luck...

DRPK
  • 2,023
  • 1
  • 14
  • 27
  • Thanks! I am trying but: Where in the `main.py` I am inserting `dupi.py`? Is there anything else to change in the `main.py` in order to see it working? – HakariDo Nov 09 '17 at 11:32
  • @HakariDo: did you read my explanation? you should coy+paste that class in dupi.py, then import dupi in any kind of your .py file with that method! may tick me plz with +1? :) – DRPK Nov 09 '17 at 11:36
  • i will definitely tick you, but let's make it work first. now i have dupi.py and main.py in the same folder. then i run main.py as it is. getting error: `line 6 ModuleNotFoundError No modul named dupi.` – HakariDo Nov 09 '17 at 11:46
  • @HakariDo: check uppercases Dupi, or dupi ( file name )? and you import dupi or Dupi? they should be same! and did you change sys.path.append('Dupi_folder_address like "." ') to sys.path.append(".")? – DRPK Nov 09 '17 at 11:52
  • dupi.py and main.py are in the same dir. main.py: ` `sys.path.append(".") from dupi import *` getting error – HakariDo Nov 09 '17 at 11:58
  • @HakariDo: this is not about python, something wrong in your paths or file names, plz add a update title to your question and put your codes ( not my original codes! ) in update box and take a screenshot from your folder, or just take a screen shot. – DRPK Nov 09 '17 at 12:15
  • @HakariDo: i will check it. – DRPK Nov 09 '17 at 12:16
  • why do not you edit the code in your answer exactly as it should be, so we can move ahead with troubleshooting. lets use lowercase dupi and enter the dir name so both main and dupi.py are in the same folder – HakariDo Nov 09 '17 at 12:20
  • @HakariDo: "ModuleNotFoundError No modul named dupi" means no such file or directory ... what should i do with my code?! it means sys.apppend works but "import" could not find dupi.py, its about your paths! but ok i will update my answer ... now..just a minute plz. – DRPK Nov 09 '17 at 12:27
  • @HakariDo: check my update. – DRPK Nov 09 '17 at 12:34
  • now seems the two files are communicating, but getting error on line 19 in main.py and line 15 and 29 in dupi.py – HakariDo Nov 09 '17 at 12:41
  • you have some special errors on your ENVV paths! and i do not know why! ok do not use it as module! just use it as class ... i will detele my other codes. check it again. – DRPK Nov 09 '17 at 12:44
  • errors lines 15, 29, 47 in updated file. – HakariDo Nov 09 '17 at 12:52
  • @HakariDo: what are the error messages?! – DRPK Nov 09 '17 at 12:54
  • ` get_data = make_object.check_files() first_file = get_size(self.path_1) + self.salt + get_last_write_time(self.path_1) st = os.stat(filename) FileNotFoundError: [WinError 2] The system cannot find the file specified:` – HakariDo Nov 09 '17 at 12:58
  • @HakariDo: commmmee onn mannn!!! did you check this line : CheckDuplicates("your_first_file_address", "your_second_file_address", your_salt)?? you should replace your own file address with them!! – DRPK Nov 09 '17 at 13:00
  • the question was how i can check duplicates in a directory. so where i am entering the path to that directory? so far i assumed the script would work in the cwd. not clear what two files you are asking for. – HakariDo Nov 09 '17 at 13:03
  • can you just make it simple and work in the cwd for now? – HakariDo Nov 09 '17 at 13:05
  • @HakariDo: so in this case please update/edit your question, because i can not file "check duplicates in a directory" sentence in your question. and i will answer you. – DRPK Nov 09 '17 at 13:10
  • what is your code doing? compares two files if they are dups? finds dups in a directory? something else? – HakariDo Nov 09 '17 at 13:15
  • @HakariDo: compares two files, if they were duplicate return [file1 hash , file2 hash, "DUPLICATE"] and if they were not return [file1 hash, file2 hash, "DIFFERENT"] . that hash is a personal hash ( mix LWT,size, size ) – DRPK Nov 09 '17 at 13:22
  • I think there was a miscommunication. Question was aimed how to find duplicate files very quickly from a set of files (e.g, from a folder or folders), not checking if two stand-alone files are dups or not. iupvote is done, but i am still waiting for the right answer. – HakariDo Nov 09 '17 at 13:39
  • @HakariDo: hi again, try this one! ... will check duplicates in a directory. – DRPK Nov 09 '17 at 14:06
  • @HakariDo: did you check that? – DRPK Nov 09 '17 at 14:15
  • OK, now the code runs smoothly. But doesn't it calculate the md5 for every files? Any way it can export all duplicate files into a duplicates.txt in the CWD? – HakariDo Nov 09 '17 at 19:34
  • @HakariDo: man it dose not calc md5!!! it is a personal hash with LWT and file_size! you wana to save dupi files in a tetx file? ok i will update it. – DRPK Nov 10 '17 at 06:48
  • what do you means "personal hash"? – HakariDo Nov 10 '17 at 06:58
  • @HakariDo: updated, just try it again. – DRPK Nov 10 '17 at 07:04
  • can you explain what this code does? i see md5, are you calculating it? also, in the output text, can you please have each duplicate files listed as a new line? – HakariDo Nov 10 '17 at 07:11
  • @HakariDo: it does not clac the md5 of each files, it calc mdf of "LWT + FILE_SIZE" of each files ( calc string md5 not file md5, muuchhh faster and efficient ) ! i recommend you to do not save dupi files as new line! why?! if you have four dupi files and another four dupi files, these are two lists, if you save them as new line how you could detect wich file with wich file are dupi?! for example A and B are dupi, C and D are another dupi but differnet from A and B! you save them as new line, how you want to detect or understand A with B are dupi or B with C are dupi?!! – DRPK Nov 10 '17 at 07:40
  • @HakariDo: you should save each lists as new line [A, B] on first line and [C, D] on second line, now you can understand ... – DRPK Nov 10 '17 at 07:40
  • i"t calc md5 of "LWT + FILE_SIZE" of each files" why calculating md5 of the LWTand size is better than simply comparing the `[LWTsize]` digits? you actually eliminating information content by doing that. **The ideal program would do this:** compare each file's [LWTsize] within a dir tree, **then** for only those files whose [LWTsize] is identical calculate the true file md5 or byte-by-byte. This would be very useful, and not far from your code. If you do it you are accepted as best answer. – HakariDo Nov 10 '17 at 07:50
  • @HakariDo: "then for only those files whose [LWTsize] is identical calculate the true file md5 or byte-by-byte" you mean if they were same, i should comparing their md5? right? so... how should i save that result? save each files as new line? or save each lists as new line? – DRPK Nov 10 '17 at 08:05
  • lets make simpler: 1. get only the size for all files. 2. compare size. 3. eliminate all unique size files. 4. on the remaining set of files (the ones whose size occurs more than once = possible dups) do an md5 or byte-by-bytes comparison. 5. print out the dup paths as a new line each. i think that would be very handy. – HakariDo Nov 10 '17 at 08:20
  • @HakariDo: done. try it again... – DRPK Nov 10 '17 at 11:14
  • can you please make the output line by line (every duplicate fullpath/file.ext is a **new line**)? i need to compare your result to my other (slow) duplicate finder programs. also can you explain the logic behind this? it compare sizes then md5? thank you! – HakariDo Nov 10 '17 at 11:21
  • @HakariDo: ok i will upadte it again!, yes it compare sizes then md5... ( i did not see this line : "print out the dup paths as a new line each." ) – DRPK Nov 10 '17 at 11:25
  • @HakariDo: Done. check it ... – DRPK Nov 10 '17 at 11:30
  • @HakariDo: is it ok? – DRPK Nov 10 '17 at 11:39
  • looks very nice on initial testing, thnak you very much! – HakariDo Nov 10 '17 at 11:41
  • DRPK: if you have time, here is another Q, you seem to be good at this: [https://stackoverflow.com/questions/47211331](https://stackoverflow.com/questions/47211331/os-rename-per-string-in-dir-name-and-fileextension-lookup-table)) – HakariDo Nov 10 '17 at 11:47