4

I have an issue creating function that compare two zip files(if they are the same, not only by name). Here is example of my code:

def validate_zip_files(self):
    host = '192.168.0.1'
    port = 2323
    username = '123'
    password = '123'
    ftp = FTP()
    ftp.connect(host, port)
    ftp.login(username,password)
    ftp.cwd('test')
    print ftp.pwd()
    ftp.retrbinary('RETR test', open('test.zip', 'wb').write)
    file1=open('test.zip', 'wb')
    file2=open('/home/user/file/text.zip', 'wb')
    return filecmp.cmp(file1, file2, shallow=True)

One of the problems is that the second zip is in different location('/home/user/file/text.zip') and i am downloading the zip file in the dir where my python script is. I am not 100% sure that filecmp.cmp works with .zip files.

Any ideas would be great :) Thanks.

noonewin
  • 108
  • 1
  • 8
  • 1
    Why don't you create a Hash (`sha-256`, for example) of both files and compare these? – jhoepken Jun 24 '15 at 13:09
  • 1
    You seem to have figured out how to download a file via ftp, which reduces your problem to "how to compare two files", right? If that's the case, could you please change the title accordingly? – jhoepken Jun 24 '15 at 13:21

2 Answers2

8

Rather than comparing the files directly, I would go ahead and compare hashed values of the files. This eliminates the dependency of filecmp, which might -as you said - not work with zipped files.

import hashlib

def compare_files(a,b):
    fileA = hashlib.sha256(open(a, 'rb').read()).digest()
    fileB = hashlib.sha256(open(b, 'rb').read()).digest()
    if fileA == fileB:
        return True
    else:
        return False
jhoepken
  • 1,842
  • 3
  • 17
  • 24
  • Strictly speeking, `fileA == fileB` doesn't always imply that the two files are identical due to hash conflict, though for sha256 the probability is very small ... – Kevin May 18 '22 at 15:18
  • heads up: zip files include some amount of metadata which might not match (even if the compressed content is identical) [link](https://stackoverflow.com/questions/9714139/why-does-zipping-the-same-content-twice-gives-two-files-with-different-sha1) – eretmochelys Mar 02 '23 at 20:02
0

See my gist that compares two zip files by their contents, and generate patch file from one zip to the other. For example, if two zip files share one entry but with different content, my gist will be able to find it out; if they have different entries, the gist can also make it. The gist ignores difference in modification time. That said, however, if you only care about a shallow comparison, then hashlib could be a better choice.

For your reference, code from the gist:

import os
import argparse
import collections
import tempfile
import zipfile
import filecmp
import shutil
import shlex

ZipCmpResult = collections.namedtuple('ZipCmpResult',
                                      ['to_rm', 'to_cmp', 'to_add'])


def make_parser():
    parser = argparse.ArgumentParser(
        description='Make patch zip file from two similar zip files.')
    parser.add_argument(
        '--oldfile',
        default=os.path.join('share', 'old.zip'),
        help='default: %(default)s')
    parser.add_argument(
        '--newfile',
        default=os.path.join('share', 'new.zip'),
        help='default: %(default)s')
    parser.add_argument(
        '--toname',
        default=os.path.join('share', 'patch'),
        help='default: %(default)s')
    return parser


def zipcmp(old, new):
    with zipfile.ZipFile(old) as zinfile:
        old_names = set(zinfile.namelist())
    with zipfile.ZipFile(new) as zinfile:
        new_names = set(zinfile.namelist())
    to_rm = old_names - new_names
    to_cmp = old_names & new_names
    to_add = new_names - old_names
    return ZipCmpResult(to_rm, to_cmp, to_add)


def compare_files(old, new, cmpresult):
    with tempfile.TemporaryDirectory() as tmpdir, \
         zipfile.ZipFile(old) as zinfile_old, \
         zipfile.ZipFile(new) as zinfile_new:
        old_dest = os.path.join(tmpdir, 'old')
        new_dest = os.path.join(tmpdir, 'new')
        os.mkdir(old_dest)
        os.mkdir(new_dest)
        for filename in cmpresult.to_cmp:
            zinfile_old.extract(filename, path=old_dest)
            zinfile_new.extract(filename, path=new_dest)
            if not filecmp.cmp(
                    os.path.join(old_dest, filename),
                    os.path.join(new_dest, filename),
                    shallow=False):
                cmpresult.to_add.add(filename)


def mkpatch(new, cmpresult, to_name):
    with zipfile.ZipFile(new) as zinfile, \
         zipfile.ZipFile(to_name + '.zip', 'w') as zoutfile:
        for filename in cmpresult.to_add:
            with zinfile.open(filename) as infile, \
                 zoutfile.open(filename, 'w') as outfile:
                shutil.copyfileobj(infile, outfile)
    with open(to_name + '.sh', 'w', encoding='utf-8') as outfile:
        outfile.write('#!/bin/sh\n')
        for filename in cmpresult.to_rm:
            outfile.write('rm {}\n'.format(shlex.quote(filename)))


def main():
    args = make_parser().parse_args()
    cmpresult = zipcmp(args.oldfile, args.newfile)
    compare_files(args.oldfile, args.newfile, cmpresult)
    mkpatch(args.newfile, cmpresult, args.toname)


if __name__ == '__main__':
    main()
Kevin
  • 143
  • 1
  • 7