35

I want to calculate the CRC of file and get output like: E45A12AC. Here's my code:

#!/usr/bin/env python 
import os, sys
import zlib

def crc(fileName):
    fd = open(fileName,"rb")
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        zlib.crc32(eachLine)

for eachFile in sys.argv[1:]:
    crc(eachFile)

This calculates the CRC for each line, but its output (e.g. -1767935985) is not what I want.

Hashlib works the way I want, but it computes the md5:

import hashlib
m = hashlib.md5()
for line in open('data.txt', 'rb'):
    m.update(line)
print m.hexdigest()

Is it possible to get something similar using zlib.crc32?

Jason Sundram
  • 12,225
  • 19
  • 71
  • 86
user203547
  • 383
  • 1
  • 3
  • 4

10 Answers10

35

A little more compact and optimized code

def crc(fileName):
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

PS2: Old PS is deprecated - therefore deleted -, because of the suggestion in the comment. Thank you. I don't get, how I missed this, but it was really good.

Bastian
  • 10,403
  • 1
  • 31
  • 40
kobor42
  • 385
  • 3
  • 3
  • 2
    If you set `prev` to 0 instead then you don't need to worry about an exception. – Ignacio Vazquez-Abrams Mar 05 '10 at 15:52
  • 3
    Something even faster which does result in the same output: def crc(filename): return "%X"%(zlib.crc32(open(filename,"rb").read()) & 0xFFFFFFFF) This reads the whole file into memory and calculates the CRC32. Granted, the bigger the file the more memory the program needs; depends on the trade-off you want, memory for speed, or speed for memory. – leetNightshade Nov 09 '12 at 00:35
  • 1
    A way to speed up the calculation considerably (factor 2--3) while keeping the memory usage low is to read fixed size chunks instead of reading "lines" from the binary file. Added a separate answer for this. – CrouZ Sep 27 '19 at 20:39
22

A modified version of kobor42's answer, with performance improved by a factor 2-3 by reading fixed size chunks instead of "lines":

import zlib

def crc32(fileName):
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

Also includes leading zeroes in the returned string.

CrouZ
  • 1,721
  • 17
  • 17
15

hashlib-compatible interface for CRC-32 support:

import zlib

class crc32(object):
    name = 'crc32'
    digest_size = 4
    block_size = 1

    def __init__(self, arg=''):
        self.__digest = 0
        self.update(arg)

    def copy(self):
        copy = super(self.__class__, self).__new__(self.__class__)
        copy.__digest = self.__digest
        return copy

    def digest(self):
        return self.__digest

    def hexdigest(self):
        return '{:08x}'.format(self.__digest)

    def update(self, arg):
        self.__digest = zlib.crc32(arg, self.__digest) & 0xffffffff

# Now you can define hashlib.crc32 = crc32
import hashlib
hashlib.crc32 = crc32

# Python > 2.7: hashlib.algorithms += ('crc32',)
# Python > 3.2: hashlib.algorithms_available.add('crc32')
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
Paulo Freitas
  • 13,194
  • 14
  • 74
  • 96
9

To show any integer's lowest 32 bits as 8 hexadecimal digits, without sign, you can "mask" the value by bit-and'ing it with a mask made of 32 bits all at value 1, then apply formatting. I.e.:

>>> x = -1767935985
>>> format(x & 0xFFFFFFFF, '08x')
'969f700f'

It's quite irrelevant whether the integer you are thus formatting comes from zlib.crc32 or any other computation whatsoever.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    Good point re: formatting, but it looks like his code also doesn't compute what he wants it to. There are really two problems here: 1) Compute the CRC of a file. 2) Display the CRC value as hex. – Jason Sundram Mar 16 '12 at 15:49
  • Not only that, but format is slower than "%X"%(x & 0xFFFFFFFF), provided kobor24's answer. But it was nice to see another way to do it, I've never used format before. – leetNightshade Nov 09 '12 at 00:32
9

Python 3.8+ (using the walrus operator):

import zlib

def crc32(filename, chunksize=65536):
    """Compute the CRC-32 checksum of the contents of the given filename"""
    with open(filename, "rb") as f:
        checksum = 0
        while (chunk := f.read(chunksize)) :
            checksum = zlib.crc32(chunk, checksum)
        return checksum

chunksize is how many bytes to read from the file at a time. You will get the same CRC for the same file no matter what you set chunksize to (it has to be > 0), but setting it too low might make your code slow, too high might use too much memory.

The result is a 32 bit integer. The CRC-32 checksum of an empty file is 0.

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
4

Edited to include Altren's solution below.

A modified and more compact version of CrouZ's answer, with a slightly improved performance, using a for loop and file buffering:

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

Results, in a 6700k, HDD:

(Note: Retested multiple times and it was consistently faster.)

Warming up the machine...
Finished.

Beginning tests...
File size: 90288KB
Test cycles: 500

With for loop and buffer.
Result 45.24728019630359 

CrouZ solution
Result 45.433838356097894 

kobor42 solution
Result 104.16215688703986 

Altren solution
Result 101.7247863946586  

Tested in Python 3.6.4 x64 using the script below:

import os, timeit, zlib, random, binascii

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

def crc32(fileName):
    """CrouZ solution"""
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

def crc(fileName):
    """kobor42 solution"""
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

def crc32altren(filename):
    """Altren solution"""
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

fpath = r'D:\test\test.dat'
tests = {forLoopCrc: 'With for loop and buffer.', 
     crc32: 'CrouZ solution', crc: 'kobor42 solution',
         crc32altren: 'Altren solution'}
count = 500

# CPU, HDD warmup
randomItm = [x for x in tests.keys()]
random.shuffle(randomItm)
print('\nWarming up the machine...')
for c in range(count):
    randomItm[0](fpath)
print('Finished.\n')

# Begin test
print('Beginning tests...\nFile size: %dKB\nTest cycles: %d\n' % (
    os.stat(fpath).st_size/1024, count))
for x in tests:
    print(tests[x])
    start_time = timeit.default_timer()
    for c in range(count):
        x(fpath)
    print('Result', timeit.default_timer() - start_time, '\n')

It is faster because for loops are faster than while loops (sources: here and here).

Polemos
  • 61
  • 5
2

There is faster and more compact way to compute CRC using binascii:

import binascii

def crc32(filename):
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash
Altren
  • 39
  • 1
2

Merge the above 2 codes as below:

try:
    fd = open(decompressedFile,"rb")
except IOError:
    logging.error("Unable to open the file in readmode:" + decompressedFile)
    return 4
eachLine = fd.readline()
prev = 0
while eachLine:
    prev = zlib.crc32(eachLine, prev)
    eachLine = fd.readline()
fd.close()
sunsys
  • 29
  • 1
0

You can use base64 for getting out like [ERD45FTR]. And zlib.crc32 provides update options.

import os, sys
import zlib
import base64

def crc(fileName): fd = open(fileName,"rb") content = fd.readlines() fd.close() prev = None for eachLine in content: if not prev: prev = zlib.crc32(eachLine) else: prev = zlib.crc32(eachLine, prev) return prev

for eachFile in sys.argv[1:]: print base64.b64encode(str(crc(eachFile)))

bhups
  • 14,345
  • 8
  • 49
  • 57
0

solution:

import os, sys
import zlib

def crc(fileName, excludeLine="", includeLine=""):
  try:
        fd = open(fileName,"rb")
  except IOError:
        print "Unable to open the file in readmode:", filename
        return
  eachLine = fd.readline()
  prev = None
  while eachLine:
      if excludeLine and eachLine.startswith(excludeLine):
            continue   
      if not prev:
        prev = zlib.crc32(eachLine)
      else:
        prev = zlib.crc32(eachLine, prev)
      eachLine = fd.readline()
  fd.close()    
  return format(prev & 0xFFFFFFFF, '08x') #returns 8 digits crc

for eachFile in sys.argv[1:]:
    print crc(eachFile)

don't realy know for what is (excludeLine="", includeLine="")...

user203547
  • 383
  • 1
  • 3
  • 4
  • 2
    I know this is ancient, but I'll explain anyway. I gave you a downvote because I don't think it's useful to post code that you don't understand. – datashaman Feb 23 '16 at 06:02