3

I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :

if os.stat("original.jpg")== os.stat("duplicate.jpg"):  
    shutil.copy("duplicate.jpg","C:\\Duplicate Folder") 

Pointers anyone?

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60
La Alquimista
  • 31
  • 1
  • 3
  • 1
    Would it be enough to use [hashlib](https://docs.python.org/3/library/hashlib.html)? – El Bert Sep 01 '14 at 14:36
  • 1
    _"checking metadata of each"_ What are exactly "duplicates" for you? Same content ? Or same content and same meta data (which ones?) – Sylvain Leroux Sep 01 '14 at 14:39
  • Duplicates would be files with the same content, so I assumed they would also have the same metadata (in all fields). I might be wrong. My os is Windows 7 Home Basic – La Alquimista Sep 01 '14 at 16:23
  • 1
    Take a look at the `filecmp` module in the standard library. It should do what you want. – Blckknght Sep 01 '14 at 17:22

4 Answers4

2

There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex

def is_duplicate(file1, file2):
    stat1, stat2 = os.stat(file1), os.stat(file2)
    return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime
user2682863
  • 3,097
  • 1
  • 24
  • 38
2

A basic loop using a set to keep track of already encountered files:

import glob
import hashlib

uniq = set()
for fname in glob.glob('*.txt'):
    with open(fname,"rb") as f:
        sig = hashlib.sha256(f.read()).digest()
        if sig not in uniq:
            uniq.add(sig)
            print fname
        else:
            print fname, " (duplicate)"

Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.

According to Thomas Pornin in an other answer :

"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."


Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:

    sig = ( hashlib.sha256(f.read()).digest(), 
            os.stat(fname).st_uid )
Community
  • 1
  • 1
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
0

If two files have the same md5 they are exact duplicates.

from hashlib import md5
with open(file1, "r") as original:
    original_md5 = md5(original.read()).hexdigest()
    with open(file2, "r") as duplicate:
       duplicate_md5 = md5(duplicate.read()).hexdigest()
       if original_md5 == duplicate_md5:
          do_stuff()

In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open

El Bert
  • 2,958
  • 1
  • 28
  • 36
  • “If two files have the same `md5` they are exact duplicates.” [Demonstrably false.](http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/) – icktoofay Sep 02 '14 at 00:13
0

os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.

For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.

So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.

If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60