2

I am very new to Python and have question. How in Python check if two files ( String and file ) have same content ? I need to download some stuffs and rename, but I don't want to save same stuff with two or more different names (same stuff can be on different ip addresses ).

Damir
  • 54,277
  • 94
  • 246
  • 365
  • 2
    What do you mean by same content? Are you trying to see if a string like `Hello World!` is the content of a file like `some_file.txt` Are these files that you are working with very large? – Devin M Aug 05 '11 at 23:28
  • @Delvin M Not very large, to 10kB. – Damir Aug 05 '11 at 23:30

6 Answers6

5

If the file is large, I would consider reading it in chunks like this:

compare.py:

import hashlib

teststr = "foo"
filename = "file.txt"

def md5_for_file(f, block_size=2**20):
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data.encode('utf8'))
    return md5.digest()


md5 = hashlib.md5()
md5.update((teststr + "\n").encode('utf8'))
digest = md5.digest()
f = open(filename, 'r')
print(md5_for_file(f) == digest)

file.txt:

foo

This program prints True if the string and file match

Baversjo
  • 3,656
  • 3
  • 34
  • 47
5

Use sha1 hash of file content.

#!/usr/bin/env python
from __future__ import with_statement
from __future__ import print_function

from hashlib import sha1

def shafile(filename):
    with open(filename, "rb") as f:
        return sha1(f.read()).hexdigest()

if __name__ == '__main__':
    import sys
    import glob
    globber = (filename for arg in sys.argv[1:] for filename in glob.glob(arg))
    for filename in globber:
        print(filename, shafile(filename))

This program takes wildcards on the command line, but it is just for demonstration purposes.

hughdbrown
  • 47,733
  • 20
  • 85
  • 108
  • 1
    I would look at the filesize first. – MRAB Aug 06 '11 at 00:34
  • _Use a checksum_, a hash is overkill. See my answer. – agf Aug 06 '11 at 02:00
  • @agf Re: "Use a checksum, a hash is overkill" I have no idea what this means. Overkill in the sense that it is onerous to program? Clearly, it is simple to program. Overkill in the sense that it takes too much time? There is no way you would notice the difference. Moreover, OP was not specific about time constraints. And it is probably fast enough: I can hash a 20MB file in 0.2 seconds on my laptop. – hughdbrown Aug 06 '11 at 13:06
  • Do you use a `dict` when you only need a `set`? It will do the job, but it's just good practice to use a `set` if that's all you need, even if you aren't worried about performance. This is the same situation. – agf Aug 06 '11 at 13:14
  • "Overkill" is "That's too much!" but with reference to some resource that you want to save, like execution time or programming time or memory or bandwidth. Using a dict when you need a set is wrong for a totally different reason: it does not support the operations that make sets preferable, like set-intersection or set-exclusion. If there is some resource that sha-1 hashes make too liberal use of, please name it. – hughdbrown Aug 08 '11 at 21:35
2

It is not necessary to use a hash if all you want is a checksum. Python has a checksum in the binascii module.

binascii.crc32(data[, crc])
agf
  • 171,228
  • 44
  • 289
  • 238
1

While hashes and checksums are great for comparing a list of files, if you are only comparing two specific files and don't have a pre-computed hash/checksum, then it is faster to compare the two files directly than it is to compute a hash/checksum for each and compare the hash/checksum

def equalsFile(firstFile, secondFile, blocksize=65536):
    buf1 = firstFile.read(blocksize)
    buf2 = secondFile.read(blocksize)
        while len(buf1) > 0:
        if buf1!=buf2:
            return False
        buf1, buf2 = firstFile.read(blocksize), secondFile.read(blocksize)
    return True

In my tests, 64 md5 checks on two 50MB files complete in 24.468 seconds, while 64 direct comparisons complete in just 4.770 seconds. This method also has the advantage of instantly returning false upon finding any difference, while calculating the hash must continue to read the entire file.

An additional way to create an early-fail on files that aren't identical is to just check their sizes before running the above test using os.path.getsize(filename). This size difference is very common when checking equality of two files with different content, and so should always be the first thing you check.

import os
if os.path.getSize('file1.txt') != os.path.getSize('file2.txt'):
    print 'false'
else:
    print equalsFile(open('file1.txt', 'rb'), open('file1.txt', 'rb'))
Hawkwing
  • 663
  • 1
  • 5
  • 13
1

The best way is to get some hash (i.e. md5) and compare it.

Here you can read how to get md5 of file.

Community
  • 1
  • 1
eugene_che
  • 1,997
  • 12
  • 12
1

For each file you download make a hash or a checksum. Keep a list of these hashes/checksums.

Then before saving the downloaded data to disk, check if the hash/checksum already exists in the list, and if it does, don't save it, but if it doesn't, save the file and add the checksum/hash to the list.

Pseudocode:

checksums = []
for url in all_urls:
    data = download_file(url)
    checksum = make_checksum(data)
    if checksum not in checksums:
         save_to_file(data)
         checksums.append(checksum)
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251