0

I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents. I want to use Python 3 to compute the hash, but my friend can't use Python 3 (because I'll wait till next year to send the file and by then Python 3 will be out of style, and he'll want to be using Python++ or whatever). All I can guarantee is that my friend will know how to compute the hash, in a mathematical sense---he might have to write his own code to run on his implementation of the MIX machine (which he will know how to do).

What hash do I use, and, more importantly, what do I take the hash of? For example, do I hash the str returned from a read on the file opened for reading as text? Do I hash some bytes-like object returned from a binary read? What if the file has weird end-of-line markers? Do I pad the tail end first so that the thing I am hashing is an appropriate size?

import hashlib
FILENAME = "filename"
# Now, what?

I say "sequence of bits" because not all computers are based on the 8-bit byte, and saying "sequence of bytes" is therefore too ambiguous. For example, GreenArrays, Inc. has designed a supercomputer on a chip, where each computer has 18-bit (eighteen-bit) words (when these words are used for encoding native instructions, they are composed of three 5-bit "bytes" and one 3-bit byte each). I also understand that before the 1970's, a variety of byte-sizes were used. Although the 8-bit byte may be the most common choice, and may be optimal in some sense, the choice of 8 bits per byte is arbitrary.

See Also

Is python's hash() portable?

Ana Nimbus
  • 635
  • 3
  • 16
  • Are you sure you're looking at a sequence of bits and not bytes? I'm not aware of any filesystems that allow file sizes to be specified in bits (although I'm only familiar with x86- and ARM-based machines). Either way, you definitely don't want to use the `str` cause that's detached from the file encoding. – wjandrea Sep 26 '21 at 23:19
  • Hmm, maybe I'm not understanding. How would Python open a file for reading as a sequence of bits and not bytes? If you did need to send a file to a computer with an exotic architecture, how would you do it? – wjandrea Sep 27 '21 at 16:35

2 Answers2

4

First of all, the hash() function in Python is not the same as cryptographic hash functions in general. Here're the differences:

hash()

A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.

Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python

What does hash do in python?

Cryptographic hash functions

A cryptographic hash function (CHF) is a mathematical algorithm that maps data of an arbitrary size (often called the "message") to a bit array of a fixed size

It is deterministic, meaning that the same message always results in the same hash.

https://en.wikipedia.org/wiki/Cryptographic_hash_function


Now let's come back to your question:

I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents.

What you're looking for is one of the cryptographic hash functions. Typically, to calculate the file hash, MD5, SHA-1, SHA-256 are used. You want to open the file as binary and hash the binary bits, and finally digest it & encode it in hexadecimal form.

import hashlib

def calculateSHA256Hash(filePath):
    h = hashlib.sha256()
    with open(filePath, "rb") as f:
        data = f.read(2048)
        while data != b"":
            h.update(data)
            data = f.read(2048)
    return h.hexdigest()

print(calculateSHA256Hash(filePath = 'stackoverflow_hash.py'))

The above code takes itself as an input, hence it produced an SHA-256 hash for itself, being 610e15155439c75f6b63cd084c6a235b42bb6a54950dcb8f2edab45d0280335e. This remains consistent as long as the code is not changed.

Another example would be to hash a txt file, test.txt with content Helloworld.

This is done by simply changing the last line of the code to "test.txt"

print(calculateSHA256Hash(filePath = 'text.txt'))

This gives a SHA-256 hash of 5ab92ff2e9e8e609398a36733c057e4903ac6643c646fbd9ab12d0f6234c8daf.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Lincoln Yan
  • 337
  • 1
  • 10
  • 2
    [Binary I/O](https://docs.python.org/3/library/io.html#binary-i-o) uses bytes, not bits. I'm not sure what that means for this answer, but it's important to know. – wjandrea Sep 26 '21 at 23:49
  • @wjandrea , RE: "bytes, not bits"---I have edited the question to add a paragraph elaborating on this. – Ana Nimbus Sep 27 '21 at 13:23
  • 2
    There is a [magic number](https://en.wikipedia.org/wiki/Magic_number_(programming)) of `2048` in `calculateSHA256Hash`. Please add a comment to explain. – Ana Nimbus Sep 27 '21 at 13:30
  • `2048` represents the size of the chunk that's being read. This is to allow the program to load large files without using lots of memory. You can kindly Google `f.read()` or similar words for more information related to this topic, for example, the official [Python Docs](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files). – Lincoln Yan Sep 27 '21 at 13:57
0

I arrived at sha256hexdigestFromFile, an alternative to @Lincoln Yan 's calculateSHA256Hash, after reviewing the standard for SHA-256.

This is also a response to my comment about 2048.

def sha256hexdigestFromFile(filePath, blocks = 1):
    '''Return as a str the SHA-256 message digest of contents of
    file at filePath.
        Reference: Introduction of NIST (2015) Secure Hash
    Standard (SHS), FIPS PUB 180-4.  DOI:10.6028/NIST.FIPS.180-4
    '''
    assert isinstance(blocks, int) and 0 < blocks, \
            'The blocks argument must be an int greater than zero.'
    with open(filePath, 'rb') as MessageStream:
        from hashlib import sha256
        from functools import reduce
        def hashUpdated(Hash, MESSAGE_BLOCK):
            Hash.update(MESSAGE_BLOCK)
            return Hash
        def messageBlocks():
            'Return a generator over the blocks of the MessageStream.'
            WORD_SIZE, BLOCK_SIZE = 4, 512 # PER THE SHA-256 STANDARD
            BYTE_COUNT = WORD_SIZE * BLOCK_SIZE * blocks
            yield MessageStream.read(BYTE_COUNT)
        return reduce(hashUpdated, messageBlocks(), sha256()).hexdigest()
Ana Nimbus
  • 635
  • 3
  • 16