3

I have two zip files. I want to see if everything (file name and each file's contents in the zip) are the same.

There is a similar question. But the answer does not support zip files.

Anyone has a good idea?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
haojie
  • 593
  • 1
  • 7
  • 19
  • 2
    Yoiu can compute SHA256 hash and compare – Epsi95 Mar 08 '21 at 04:23
  • It was global news when Chinese researches found identical hashes for two byte sequences that were 1 byte off. In other words, barring a blue-moon, using the SHA256 against the file is sufficient to know that the contents are identical. Or you could use a heavier hashing function. I would give a go with the solution you linked to. – Razzle Shazl Mar 08 '21 at 04:59
  • Here's a [link](https://crypto.stackexchange.com/questions/47809/why-havent-any-sha-256-collisions-been-found-yet) discussing why SHA256 collisions have not been found yet. Hopefully that discussion can convince you. – Razzle Shazl Mar 08 '21 at 05:02
  • *So, given ~2600 universe-lifespans, all the bitcoin miners together would have a good shot at finding data that shares a given SHA-256 hash? Am I interpreting+calculating that right?* – Razzle Shazl Mar 08 '21 at 05:05
  • 2
    Zip files that contain identical items can have different hashes, because the contents of a zip file depends on the order items are added to it. Is it acceptable for the test to have false negatives? That is it reports the files are different even though they have the same contents? – RootTwo Mar 08 '21 at 05:24
  • @jasonshu. Let's say you have a script that compares zipfiles and reports if they were the same. The script is always correct when it reports the zipfiles are the same, but is sometimes wrong when it reports the zips are different (a false negative). My comment/question was whether that kind of error is okay, or do you need to know absolutely whether the zipfile contents are identical. – RootTwo Mar 08 '21 at 06:05

3 Answers3

3

Here's my stab at it. It may be sufficient to just make sure the ZipFiles contain the same items and that the items have matching CRC32s. (What is the chance that two ZipFiles being compared have files with the same name and same CRC32 but are different files?) If that is good enough, omit the loop that compares the file contents.

from zipfile import ZipFile

BUFSIZE = 1024

def are_equivalent(filename1, filename2):
    """Compare two ZipFiles to see if they would expand into the same directory structure
    without actually extracting the files.
    """
    
    with ZipFile(filename1, 'r') as zip1, ZipFile(filename2, 'r') as zip2:
        
        # Index items in the ZipFiles by filename. For duplicate filenames, a later
        # item in the ZipFile will overwrite an ealier item; just like a later file
        # will overwrite an earlier file with the same name when extracting.
        zipinfo1 = {info.filename:info for info in zip1.infolist()}
        zipinfo2 = {info.filename:info for info in zip2.infolist()}
        
        # Do some simple checks first
        # Do the ZipFiles contain the same the files?
        if zipinfo1.keys() != zipinfo2.keys():
            return False
        
        # Do the files in the archives have the same CRCs? (This is a 32-bit CRC of the
        # uncompressed item. Is that good enough to confirm the files are the same?)
        if any(zipinfo1[name].CRC != zipinfo2[name].CRC for name in zipinfo1.keys()):
            return False
        
        # Skip/omit this loop if matching names and CRCs is good enough.
        # Open the corresponding files and compare them.
        for name in zipinfo1.keys():
            
            # 'ZipFile.open()' returns a ZipExtFile instance, which has a 'read()' method
            # that accepts a max number of bytes to read. In contrast, 'ZipFile.read()' reads
            # all the bytes at once.
            with zip1.open(zipinfo1[name]) as file1, zip2.open(zipinfo2[name]) as file2:
                
                while True:
                    buffer1 = file1.read(BUFSIZE)
                    buffer2 = file2.read(BUFSIZE)
                    
                    if buffer1 != buffer2:
                        return False
                    
                    if not buffer1:
                        break
                        
        return True
RootTwo
  • 4,288
  • 1
  • 11
  • 15
0

Seems zip will have different hashes even you zip identical items. We can divide it two parts: first is to unzip, second is to compare folders after unzip.

import os
import filecmp
import zipfile

def are_dir_trees_equal(dir1, dir2):

    dirs_cmp = filecmp.dircmp(dir1, dir2)
    if len(dirs_cmp.left_only)>0 or len(dirs_cmp.right_only)>0 or \
       len(dirs_cmp.funny_files)>0:
       return False
    (_, mismatch, errors) =  filecmp.cmpfiles(
        dir1, dir2, dirs_cmp.common_files, shallow=False)
    if len(mismatch)>0 or len(errors)>0:
        return False
    for common_dir in dirs_cmp.common_dirs:
        new_dir1 = os.path.join(dir1, common_dir)
        new_dir2 = os.path.join(dir2, common_dir)
        if not are_dir_trees_equal(new_dir1, new_dir2):
           return False
    return True


BASE_PATH = '/Users/Documents/'
model1 = os.path.join(BASE_PATH, 'test1.zip')
model2 = os.path.join(BASE_PATH, 'test2.zip')

with zipfile.ZipFile(model1,"r") as zip_ref:
    zip_ref.extractall(BASE_PATH)
with zipfile.ZipFile(model2,"r") as zip_ref:
    zip_ref.extractall(BASE_PATH)

folder1 = model1.split('.')[0]
folder2 = model2.split('.')[0]

is_equal = are_dir_trees_equal(folder1, folder2)
print(is_equal)
haojie
  • 593
  • 1
  • 7
  • 19
0

I tried using zipfile builtin module in python.

from zipfile import ZipFile

def compare(file1, file2):
    try:
        with ZipFile(file1, 'r') as file:
            f1 = str([x for x in file.infolist()])
        with ZipFile(file2, 'r') as file:
            f2 = str([x for x in file.infolist()])
        return f1 == f2
    except FileNotFoundError:
        return f"Either file at {file1} or {file2} does not exist"


f = compare(file1='storage/1.zip', file2='storage/2.zip')
print(f)

I don't know this is correct approach or not (if it is not please correct me)

KailasMM
  • 106
  • 1
  • 5