1

I have a program that recursively goes through 2 directories and puts the filename:sha256hash into 2 dicts, folder1 and folder2.

What I want to do is a comparison of the hashes and if the hashes match but the key is different, pub the key into a new list called "renamed". I have the logic in place to account for deleted files, new files, and files where the key is the same but the value(hash) is different (a modified file) but can't for the life of me get my head around doing the opposite.

    # Put filename:hash into 2 dictionaries from the folders to compare

    for root, dirs, files in os.walk(folder_1):
        for file in files:
            files1[file] = get_hash(os.path.join(root,file))

    for root, dirs, files in os.walk(folder_2):
        for file in files:
            files2[file] = get_hash(os.path.join(root, file))

    # Set up the operations to do for the comparison 

    set_files2, set_files1 = set(files2.keys()), set(files1.keys())
    intersect = set_files2.intersection(set_files1)

    # Compare and add to list for display

    created.extend(set_files2 - intersect)
    deleted.extend(set_files1 - intersect)
    modified.extend(set(k for k in intersect if files1[k] != files2[k]))
    unchanged.extend(set(k for k in intersect if files1[k] == files2[k]))

The issue with this is 1: it doesn't account for renamed files, 2: it puts renamed files into created, so once I have renamed files I have to created = created - renamed to filter those out of actual new files.

Any/all help is appreciated. I've come this far but for some reason my mind is on strike.

Steve
  • 29
  • 2
  • 2
    Why not just flip the dictionary (key hash and value filename)? To save time, just construct this opposite dictionary in the same time you build the original dictionary. – Bitwise Jun 20 '13 at 17:40
  • @Bitwise That'll save time at the cost of space, but I'm surprised I didn't think of that. – 2rs2ts Jun 20 '13 at 17:43
  • 2
    Your design will fail in the case of two files with the same name. It is possible for there to be, say, `myfolder1/file1` and `myfolder2/file1`, and you will consider them the same by doing `...[file] = get_hash(...)` (overwriting one or the other based on the random order you visited them). I am also confused why you are hashing. – ninjagecko Jun 20 '13 at 17:52
  • 1
    The solution to the problem @ninjagecko raises is to probably use the entire (relative) path i.e. `myfolder1/file1`. – 2rs2ts Jun 20 '13 at 18:01
  • @2rs2ts: I'm not sure that would work by itself, might have to make a few more changes. – ninjagecko Jun 20 '13 at 19:38

3 Answers3

3

You can flip your files1 and files2 dicts:

name_from_hash1 =  {v:k for k, v in file1.items()}
name_from_hash2 =  {v:k for k, v in file2.items()}

(The flipping idiom I found on this SO answer.)

Then,

renamed = []
for h in name_from_hash1:
    if h in name_from_hash2 and name_from_hash1[h] != name_from_hash2[h]:
        renamed.append(name_from_hash2[h])

renamed is then the list of renamed filenames by their current names. You can get the list of the original names of the renamed files by changing name_from_hash2 to name_from_hash in the last line.

Community
  • 1
  • 1
Alex Szatmary
  • 3,431
  • 3
  • 21
  • 30
  • 1
    It might save time and space to use only one dict, and check to see if the hash is already a key in the dict - if it is, append it to `renamed`. You would be iterating over smaller sets overall. – 2rs2ts Jun 20 '13 at 17:57
  • Thanks, I had searched around for something but I think I was looking for the wrong solution. This definitely fits the bill. Thanks so much, I actually hadn't considered just flipping things around. – Steve Jun 20 '13 at 18:16
2

I've got a simple solution for you: rather than having the filenames as keys and hashes as values, have the hashes as keys and filenames as values (after all, you want the keys to be unique, not the values). You'd simply have to adjust the rest of your program to account for that. (Oops, looks like Bitwise already mentioned that in a comment. Oh well.)

If you don't want to change the rest of your code, here's a good one-liner method to create a set of renamed files, if you're using Python 2.7+:

renamedfiles = {k for k, v in hashes1.items() if v in hashes2.values()}

For slightly increased efficiency in Python 2.7, use iteritems() and itervalues() instead (Python 3 represents its key, item, and value views as iterators by default).

Addendum: You could also do renamedfiles = filter(lambda item:item in hashes2.values(), hashes1.items()), though that would result in an iterator over the qualifying key/value pairs rather than a set or dict. Also, I believe comprehensions are generally preferred in Python even though filter() is one of the built-in methods.

JAB
  • 20,783
  • 6
  • 71
  • 80
  • Ah, better yet, use only one dict and check to see if the hash is already a key in the dict (`if k in d: renamed.append(k)`). – 2rs2ts Jun 20 '13 at 17:55
  • 1
    @2rs2ts That would work too. In fact, that's basically what my set comprehension does, except possibly more efficient due to not needing extra space allocated for the hash values as they aren't needed in the renamed set. – JAB Jun 20 '13 at 18:03
  • yeah you edited your post around the same tine as my comment, so it's cool that we were thinking along the same lines. :) – 2rs2ts Jun 20 '13 at 18:07
0

This is a bad polynomial time solution I came up with quickly, but:

>>> d1 = {'a':1, 'b':2, 'c':3}    
>>> d2 = {'a':1, 'b':3, 'c':2}
>>> for key1 in d1:
...     for key2 in d2:
...             if d1[key1] == d2[key2] and key1 != key2:
...                     print key1, key2
... 
c b
b c

This code prints the key in d2 for which its value is the same as a key in d1, but only if the two keys are different. Adapt this for how you meant to put the changed keys in your modified list.

2rs2ts
  • 10,662
  • 10
  • 51
  • 95