102

Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?

What is the easiest way to see if two files are the same content-wise in Python.

One thing I can do is md5 each file and compare. Is there a better way?

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Josh Gibson
  • 21,808
  • 28
  • 67
  • 63
  • 11
    I'm really unhappy with the answers this questions has. The top answer makes it seem like `filecmp.cmp(a, b)` compares files **byte-by-byte**, which it **very much doesn't!** It just checks cached `os.stat()` signatures, which for me at least led to false positives. Only `filecmp.cmp(a, b, shallow=True)` does a true byte-by-byte comparison. – xjcl Oct 29 '20 at 09:28
  • 14
    @xjcl I think you mean `shallow=False` – kuzzooroo Jan 29 '21 at 05:00
  • @kuzzooroo yes, darn it! – xjcl Jan 30 '21 at 10:13

2 Answers2

177

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.

Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.

See http://docs.python.org/library/filecmp.html e.g.

>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False

Note that by default, filecmp does not compare the contents of the files, to do so, add a third parameter shallow=False.

Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte

Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.

import random
import string
import hashlib
import time

def getRandText(N):
    return  "".join([random.choice(string.printable) for i in xrange(N)])

N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)

def cmpHash(text1, text2):
    hash1 = hashlib.md5()
    hash1.update(text1)
    hash1 = hash1.hexdigest()
    
    hash2 = hashlib.md5()
    hash2.update(text2)
    hash2 = hash2.hexdigest()
    
    return  hash1 == hash2

def cmpByteByByte(text1, text2):
    return text1 == text2

for cmpFunc in (cmpHash, cmpByteByByte):
    st = time.time()
    for i in range(10):
        cmpFunc(randText1, randText2)
    print cmpFunc.func_name,time.time()-st

and the output is

cmpHash 0.234999895096
cmpByteByByte 0.0
Michael Mior
  • 28,107
  • 9
  • 89
  • 113
Anurag Uniyal
  • 85,954
  • 40
  • 175
  • 219
  • 18
    No reason to do an expensive hash when a simple byte-by-byte comparison will work. +1 for filecmp – John Kugelman Jul 02 '09 at 04:58
  • 19
    If you have many huge files there's no reason to do an expensive byte-by-byte comparison when a simple hash calculation will work. – Vinko Vrsalovic Jul 02 '09 at 05:01
  • yes agree , unless we have to compare N files with each other, can filecmp work there or be faster than hash? – Anurag Uniyal Jul 02 '09 at 05:01
  • 4
    @vinko usually hash should be slower than byte-by-byte cmp, but as byte-by-byte cmp will be in python for loop I think it will be slower, as is the case of filecmp implementation – Anurag Uniyal Jul 02 '09 at 05:02
  • 5
    Well, for a realistic test, one where the benefits of hashing for this purpose show, you should compare a single (same) 'file' to many different files, not just single pairs. In case I wasn't clear before: of course I agree that for the case where you will compare each file to only one other file byte-by-byte comparison will be faster (after all you have to read the whole file and make calculations to get a hash), things start to change when you want to compare one file to many other files, where the cost of calculating the hashes gets compensated by the number of comparisons. – Vinko Vrsalovic Jul 02 '09 at 05:42
  • 1
    yes I agree and if you read my answer first line is "hashing the file would be the best way if you have to compare several files and store hashes for later comparison" and my first comment above also says so – Anurag Uniyal Jul 02 '09 at 05:57
  • 5
    doesn't `filecmp(f1,f2)` by default only compare the stat of two files, not their actual bytes? Unless I'm mistaken, I don't think that's the desired behavior [filecmp](https://docs.python.org/2/library/filecmp.html) – Edward Newell Jul 10 '14 at 17:13
  • 2
    @nosklo If you're worried about hash collisions, get asteroid insurance. – Edward Newell Jul 10 '14 at 17:14
  • @nosklo If you're comparing just two files with sha1 hash then I don't think you need to worry about that. It's only about 10 times as likely as being struck by lightning AND your house being crushed by a meteor (<1e-18). For many millions of files it does becomes a problem due to the birthday paradox. – Mark Feb 03 '16 at 14:06
  • I cannot see how the fimes are compared! But in general. if Comparing two flat directories (this can be files in a directory with itself), isn't it best to list the files according to size, and then for files that have the same siz – user96265 Jan 19 '20 at 00:58
  • @EdwardNewell Yes, filecmp is a shallow comparison by default. filecmp.cmp(filename1, filename2, shallow=False) is necessary in most cases. – dstromberg Jan 28 '21 at 04:38
  • @EdwardNewell: Yes, but that's when `shallow==True`. But even in the case of `shallow==False`, first a few selected properties of stat are checked and the function returns `True` if it matches. So yes, blindly using `filecmp` is not recommended. – Nav Mar 06 '21 at 12:48
  • Should we use `shallow=False` or `shallow=True`? – alper Sep 28 '21 at 23:37
7

I'm not sure if you want to find duplicate files or just compare two single files. If the latter, the above approach (filecmp) is better, if the former, the following approach is better.

There are lots of duplicate files detection questions here. Assuming they are not very small and that performance is important, you can

  • Compare file sizes first, discarding all which doesn't match
  • If file sizes match, compare using the biggest hash you can handle, hashing chunks of files to avoid reading the whole big file

Here's is an answer with Python implementations (I prefer the one by nosklo, BTW)

Community
  • 1
  • 1
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • Files sizes may differ if there is additional new-line or space at the end of compared file even their contents are same – alper Mar 11 '22 at 11:43