How to improve a script for checking that files from one driver are present in another drive

Question

I'm trying to check if some files in drive A are present or missing in another drive B, so I did this little script:

import os
import subprocess

# drive A
SOURCE_PATH = '/media/username/8e223d5b-2755-4e9f-a2f6-fac5e762e836/username'
# drive B
DESTINY_PATH = '/home/username/'
SUCCESS_CODE = 0


if __name__ == '__main__':
    source_file = ''
    destiny_file = ''
    for source_actualdir, source_subdir, source_dirFiles in os.walk(SOURCE_PATH):
        for source_filename in source_dirFiles:
            for destiny_actualdir, destiny_subdir, destiny_dirFiles in os.walk(DESTINY_PATH):
                for destiny_filename in destiny_dirFiles:
                    source_file = os.path.join(source_actualdir, source_filename)
                    destiny_file = os.path.join(destiny_actualdir, destiny_filename)
                    response = subprocess.run(['diff',  '-s', f'{source_file}', f'{destiny_file}'], capture_output=True)
                    if response.returncode == SUCCESS_CODE:
                        print(f'Coincidencia {source_file} == {destiny_file}')
                        break
            print(f'File {source_file} is missing in {DESTINY_PATH}')

but I find it too slow when running it (I have to check 242603 files for a total of 145GBs) and I'd like to speed it up but I don't know how.

What can I use for such a task?

What you're doing is for every file in the source, search every file in the dest, and then diff them. What might be quicker is building a `set` of all files in both directories, and then checking overlap. — blueteeth, Sep 12 '22 at 16:38
Tangentially, the code you put inside `if __name__ == "__main__":` should be absolutely trivial. The condition is only useful when you `import` this code; if all the useful functionality is excluded when you `import`, you will never want to do that anyway. See also https://stackoverflow.com/a/69778466/874188 — tripleee, Sep 12 '22 at 16:58
A common approach is to obtain checksums for all the files; the ones which exist in both sets are duplicates. Depending on your desired precision, MD5 might be good enough, but it is more prone to collisions than SHA1 or SHA256. — tripleee, Sep 12 '22 at 17:02

score 1 · Answer 1 · answered Sep 12 '22 at 17:05

To expand on my comment, say drive A contains foo, bar, bat, and drive B contains fizz, buzz, bat, then you're doing these checks:

foo, fizz
foo, buzz
foo, bat
bat, fizz
bar, buzz,
bar, bat
bat, fizz
bat, buzz
bat, bat => success

So the number of things you have to check is len(A) * len(B).

Whereas, say you had

A = {"foo", "bar", "bat"}
B = {"fizz", "buzz", "bat"}

You could loop through A and check if it's in B. Then you're only looping through one thing - you should do the smaller of these if there's a significant difference.

Then you've got the problem of how to check if the files are the same. I would use MurmurHash. There are Python bindings here.

So instead of diffing everything, when you loop through you calculate the hash, and store it against the path (in a dict). Then you check the overlap again, but this time, for any matches you know that the content is the same, and what the paths are.

Something like:

import mmh3

def get_path_map(file_paths): 
    d = {}
    for path in file_paths:
        with open(path) as f:
            hs = mmh3.hash(f.read())
        d[hs] = path
    return d

sources = get_path_map(...)  # you can do this bit
dests = get_path_map(...)

for hs, path in sources.items():
    if hs in dests:
        print(f"Match found: '{path}' is the same as '{dests[hs]}'")

How to improve a script for checking that files from one driver are present in another drive

1 Answers1