8

I need a Python/C/C++/Java implementation which can pause hashing progress and store that progress in a file in such a way that the progress is recoverable from that file at a later stage.

No matter in what language it is written from above listed, it should work properly in Python. It is recommended that you may provide that to work well with "hashlib" however it is not necessary. Also, if such a thing already exist, a link to that is sufficient.

For an idea, what your implementation should achieve.

import hashlib
import hashpersist #THIS IS NEEDED.

sha256 = hashlib.sha256("Hello ")
hashpersist.save_state(sha256, open('test_file', 'w'))

sha256_recovered = hashpersist.load_state(open('test_file', 'r'))
sha256_recovered.update("World")
print sha256_recovered.hexdigest()

This should give the same output as we had done simple hashing of "Hello World" with standard sha256 function.

a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e
Devesh Saini
  • 557
  • 5
  • 16
  • 2
    possible duplicate of [Persisting hashlib state](http://stackoverflow.com/questions/2130892/persisting-hashlib-state) – PM 2Ring Nov 10 '14 at 08:14
  • As you've no doubt discovered you can't pickle hashlib's HASH objects; see [Persisting hashlib state](http://stackoverflow.com/q/2130892/4014959) for an explanation and some options. But you can speed up your hashing by using a larger blocksize, eg 64kB. – PM 2Ring Nov 10 '14 at 08:17
  • @PM2Ring No answer is satisfying in your suggested question because I need something(SOME_CLASS) which can persist any Hash object from hashlib. – Devesh Saini Nov 10 '14 at 08:29
  • 1
    The main answer in the question I linked to explains why you **can't** persist Hash objects from hashlib. I agree that the options mentioned in those answers are unsatisfactory. If you _really_ need hash objects that can persist you'll need to write your own module. Have you tried my suggestion of using a larger block size? In my experiments 64 kilobytes (65536 bytes) works quite well. – PM 2Ring Nov 10 '14 at 19:14
  • That question I linked to has links to pure Python implementations of sha256 and MD5 which can be used to make persistable hash objects, and [this answer](http://stackoverflow.com/a/5866304/4014959) shows how to do it for MD5. But I expect that on standard Python this approach will be *much* slower than hashlib, because most of the work in hashlib is done by the OpenSSL library, which is compiled C/C++ code. – PM 2Ring Nov 11 '14 at 05:54
  • My previous answer to this question has been migrated to [a new question](http://stackoverflow.com/questions/26880953/is-there-any-hash-function-which-have-following-properties/26881016#26881016), as suggested in a (deleted) comment by Devesh Saini. – PM 2Ring Nov 12 '14 at 10:37
  • @PM2Ring What do you want to indicate? – Devesh Saini Nov 12 '14 at 10:58
  • I'm just indicating the new location of my program in case anyone was wondering what happened to it. Why have you put a bounty on this question? – PM 2Ring Nov 12 '14 at 11:08
  • I have put a bounty on it because it haven't got much attention. Also, If someone having the answer, feeling lazy to write SOME_CLASS for just few points then this small bounty may get his/her attention to write code for me. If I had more reputation, I would offered a greater bounty on it because I seriously need answer. – Devesh Saini Nov 12 '14 at 11:21
  • 2
    But as has already been explained, **it's simply not possible** to "pause/resume md5 and sha256 hash object from standard "hashlib" module". The state data needed to make it possible is not accessible in any way via the Python hashlib and its hash objects. However, a new, functionally-equivalent, implementation of hashlib _could_ allow such pausing and resuming, since the `struct SHAstate_st` aka `SHA_CTX` _is_ accessible to C programs that use OpenSSL. See `` (on my system that header is in `/usr/include`). – PM 2Ring Nov 12 '14 at 11:44
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/64784/discussion-between-devesh-saini-and-pm-2ring). – Devesh Saini Nov 12 '14 at 11:50
  • hmm... write this code for +50 rep, or keep it to myself so that only MY bitcoin miner speeds up? You've given us a tough choice. – iwolf Nov 14 '14 at 00:10
  • @iwolf : Hmmm. I know nothing about bitcoin. Maybe I shouldn't post my new code. OTOH, it runs at about the same speed as `sha256sum`, so maybe it gives no advantage... but what would _I_ know. :) – PM 2Ring Nov 14 '14 at 04:36

2 Answers2

8

It turned out to be easier than I thought to rewrite hashlib to be resumable, at least, the SHA-256 portion. I spent some time playing with C code that uses the OpenSSL crypto library, but then I realised that I don't need all that stuff, I can just use ctypes.

rehash.py

#! /usr/bin/env python

''' A resumable implementation of SHA-256 using ctypes with the OpenSSL crypto library

    Written by PM 2Ring 2014.11.13
'''

from ctypes import *

SHA_LBLOCK = 16
SHA256_DIGEST_LENGTH = 32

class SHA256_CTX(Structure):
    _fields_ = [
        ("h", c_long * 8),
        ("Nl", c_long),
        ("Nh", c_long),
        ("data", c_long * SHA_LBLOCK),
        ("num", c_uint),
        ("md_len", c_uint)
    ]

HashBuffType = c_ubyte * SHA256_DIGEST_LENGTH

#crypto = cdll.LoadLibrary("libcrypto.so")
crypto = cdll.LoadLibrary("libeay32.dll" if os.name == "nt" else "libssl.so")

class sha256(object):
    digest_size = SHA256_DIGEST_LENGTH

    def __init__(self, datastr=None):
        self.ctx = SHA256_CTX()
        crypto.SHA256_Init(byref(self.ctx))
        if datastr:
            self.update(datastr)

    def update(self, datastr):
        crypto.SHA256_Update(byref(self.ctx), datastr, c_int(len(datastr)))

    #Clone the current context
    def _copy_ctx(self):
        ctx = SHA256_CTX()
        pointer(ctx)[0] = self.ctx
        return ctx

    def copy(self):
        other = sha256()
        other.ctx = self._copy_ctx()
        return other

    def digest(self):
        #Preserve context in case we get called before hashing is
        # really finished, since SHA256_Final() clears the SHA256_CTX
        ctx = self._copy_ctx()
        hashbuff = HashBuffType()
        crypto.SHA256_Final(hashbuff, byref(self.ctx))
        self.ctx = ctx
        return str(bytearray(hashbuff))

    def hexdigest(self):
        return self.digest().encode('hex')

#Tests
def main():
    import cPickle
    import hashlib

    data = ("Nobody expects ", "the spammish ", "imposition!")

    print "rehash\n"

    shaA = sha256(''.join(data))
    print shaA.hexdigest()
    print repr(shaA.digest())
    print "digest size =", shaA.digest_size
    print

    shaB = sha256()
    shaB.update(data[0])
    print shaB.hexdigest()

    #Test pickling
    sha_pickle = cPickle.dumps(shaB, -1)
    print "Pickle length:", len(sha_pickle)
    shaC = cPickle.loads(sha_pickle)

    shaC.update(data[1])
    print shaC.hexdigest()

    #Test copying. Note that copy can be pickled
    shaD = shaC.copy()

    shaC.update(data[2])
    print shaC.hexdigest()


    #Verify against hashlib.sha256()
    print "\nhashlib\n"

    shaD = hashlib.sha256(''.join(data))
    print shaD.hexdigest()
    print repr(shaD.digest())
    print "digest size =", shaD.digest_size
    print

    shaE = hashlib.sha256(data[0])
    print shaE.hexdigest()

    shaE.update(data[1])
    print shaE.hexdigest()

    #Test copying. Note that hashlib copy can NOT be pickled
    shaF = shaE.copy()
    shaF.update(data[2])
    print shaF.hexdigest()


if __name__ == '__main__':
    main()

resumable_SHA-256.py

#! /usr/bin/env python

''' Resumable SHA-256 hash for large files using the OpenSSL crypto library

    The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
    When a signal is received, hashing continues until the end of the
    current chunk, then the current file position, total file size, and
    the sha object is saved to a file. The name of this file is formed by
    appending '.hash' to the name of the file being hashed.

    Just re-run the program to resume hashing. The '.hash' file will be deleted
    once hashing is completed.

    Written by PM 2Ring 2014.11.14
'''

import cPickle as pickle
import os
import signal
import sys

import rehash

quit = False

blocksize = 1<<16   # 64kB
blocksperchunk = 1<<8

chunksize = blocksize * blocksperchunk

def handler(signum, frame):
    global quit
    print "\nGot signal %d, cleaning up." % signum
    quit = True


def do_hash(fname, filesize):
    hashname = fname + '.hash'
    if os.path.exists(hashname):
        with open(hashname, 'rb') as f:
            pos, fsize, sha = pickle.load(f)
        if fsize != filesize:
            print "Error: file size of '%s' doesn't match size recorded in '%s'" % (fname, hashname)
            print "%d != %d. Aborting" % (fsize, filesize)
            exit(1)
    else:
        pos, fsize, sha = 0, filesize, rehash.sha256()

    finished = False
    with open(fname, 'rb') as f:
        f.seek(pos)
        while not (quit or finished):
            for _ in xrange(blocksperchunk):
                block = f.read(blocksize)
                if block == '':
                    finished = True
                    break
                sha.update(block)

            pos += chunksize
            sys.stderr.write(" %6.2f%% of %d\r" % (100.0 * pos / fsize, fsize))
            if finished or quit:
                break

    if quit:
        with open(hashname, 'wb') as f:
            pickle.dump((pos, fsize, sha), f, -1)
    elif os.path.exists(hashname):
        os.remove(hashname)

    return (not quit), pos, sha.hexdigest()


def main():
    if len(sys.argv) != 2:
        print "Resumable SHA-256 hash of a file."
        print "Usage:\npython %s filename\n" % sys.argv[0]
        exit(1)

    fname = sys.argv[1]
    filesize = os.path.getsize(fname)

    signal.signal(signal.SIGINT, handler)
    signal.signal(signal.SIGTERM, handler)

    finished, pos, hexdigest = do_hash(fname, filesize)
    if finished:
        print "%s  %s" % (hexdigest, fname)
    else:
        print "sha-256 hash of '%s' incomplete" % fname
        print "%s" % hexdigest
        print "%d / %d bytes processed." % (pos, filesize)


if __name__ == '__main__':
    main()

demo

import rehash
import pickle
sha=rehash.sha256("Hello ")
s=pickle.dumps(sha.ctx)
sha=rehash.sha256()
sha.ctx=pickle.loads(s)
sha.update("World")
print sha.hexdigest()

output

a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e

edit

I've just made a minor edit to allow rehash to work on Windows, too, although I've only tested it on WinXP. The libeay32.dll can be in the current directory, or somewhere in the system library search path, eg WINDOWS\system32. My rather ancient (and mostly unused) XP installation couldn't find the .dll, even though it's used by OpenOffice and Avira. So I just copied it from the Avira folder to system32. And now it works perfectly. :)

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • It gives, OSError: libcrypto.so: cannot open shared object file... Can you write/update instruction for how to get it work? – Devesh Saini Nov 14 '14 at 06:40
  • Maybe you don't have that library file, but you should if you're on Linux and have OpenSSL installed. But try changing the library to "libssl.so", i.e. change `crypto = cdll.LoadLibrary("libcrypto.so")` to `crypto = cdll.LoadLibrary("libssl.so")`. If you're not using Linux or some other form of Unix, you may have to use a slightly different syntax. – PM 2Ring Nov 14 '14 at 06:58
  • I'm not very familiar with Windows, but it looks like the OpenSSL crypto library name on Windows is `ssleay32.dll`. And the equivalent to `libssl.so` (the main OpenSSL library) is `libeay32.dll`. – PM 2Ring Nov 14 '14 at 07:06
  • I tried changing the name to `libssl.so` but no effect. It keeps on raising same exception. I also tried installing openssl by doing `apt-get install openssl` and tried both names i.e. `libcrypto.so` and `libssl.so`. Same problem! – Devesh Saini Nov 14 '14 at 07:16
  • Those libraries should be in `/usr/lib/`, as symbolic links to the actual libraries. I normally use `synaptic` to manage packages. On my system, the SSL libraries are part of the `libssl0.9.8` package; I guess `apt-get install libssl0.9.8` should work, but I don't understand why that wouldn't have been automatically installed as a dependency of the main OpenSSL package. _Maybe_ you also need the developer files in the `libssl-dev` package so that the Python ctypes stuff works. – PM 2Ring Nov 14 '14 at 07:34
  • And maybe you need to fix your sources list. Did you modify it a few months back when Heartbleed was a thing? Those libraries (and some others) should be listed if you do `ldd $(which openssl)`. Also try (as root) `dpkg -l | grep libssl`. – PM 2Ring Nov 14 '14 at 07:56
  • I first installed `libssl0.9.8` and then `libssl-dev`. Now, it is working. Perfect solution :) – Devesh Saini Nov 14 '14 at 10:04
  • 1
    Excellent! And thankyou. :big grin:. With a minor change it also works on Windows, although I've only tested it on WinXP. – PM 2Ring Nov 14 '14 at 11:40
1

A pure python that support import / export to python dict: https://pypi.org/project/sha256bit/

Demo:

>>> from sha256bit import Sha256bit
>>> h1 = Sha256bit("a".encode())
>>> state = h1.export_state()
>>> h2 = Sha256bit.import_state(state)
>>> h2.update("bc".encode())
>>> h2.hexdigest()
'ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad'

state is a regular python dict that can be persisted.

acapola
  • 1,078
  • 1
  • 12
  • 23