17

I'm trying to write a small script that prints the checksum of a file (using some code from https://gist.github.com/Zireael-N/ed36997fd1a967d78cb2):

import sys
import os
import hashlib

file = '/Users/Me/Downloads/2017-11-29-raspbian-stretch.img'

with open(file, 'rb') as f:
    contents = f.read()
    print('SHA256 of file is %s' % hashlib.sha256(contents).hexdigest())

But I'm getting the following error message:

Traceback (most recent call last):
  File "checksum.py", line 8, in <module>
    contents = f.read()
OSError: [Errno 22] Invalid argument

What am I doing wrong? I'm using python 3 on macOS High Sierra

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Hallvard
  • 191
  • 1
  • 1
  • 7
  • 1
    Cannot reproduce. Does this happen when trying to take the checksum of any file or only a specific file? Are you using Python 2 or Python 3? Why does your error message say `contents = f.read()` is line 8 when it's only line 6 of the code given here? – jwodder Jan 05 '18 at 23:48
  • Have you tried other files? Python simply translates error code it got from operating system (`EINVAL`) and there is a chance that error code comes from filesystem driver itself (so it can be bug in it). Normally `EINVAL` in response to read means that fd number is wrong but it is unususal situation as Python controls correctness of fd number on itself. – myaut Jan 05 '18 at 23:48
  • Possible duplicate of [OSError \[Errno 22\] invalid argument when use open() in Python](https://stackoverflow.com/questions/25584124/oserror-errno-22-invalid-argument-when-use-open-in-python) – roganjosh Jan 05 '18 at 23:48
  • 1
    @roganjosh: The answers given there only apply to Windows systems. This question appears to be about a problem on macOS. – jwodder Jan 05 '18 at 23:51
  • 1
    I tried other files now, and they worked fine. The original .img file still gives the same error message though. Could it be because of it's size of 4,92 GB? – Hallvard Jan 06 '18 at 00:48
  • 2
    @Hallvard: What version of Python are you on? And are you on a 32 bit system? There are a couple issues that could arise depending on the answers to those two questions. – ShadowRanger Jan 06 '18 at 01:00
  • @roganjosh: It's not a duplicate of that one. That one is an error on `open` (bad file path), for this question the `open` succeeded, but the `read` failed. The causes of one are unlikely to relate to the other. – ShadowRanger Jan 06 '18 at 01:31
  • Why not just use `sha256sum` – Charles D Pantoga Jan 06 '18 at 01:32
  • Regardless, it's silly to read the whole multi-GB file into memory at once -- there's no need to do that; you can hash it a piece at a time. – Charles Duffy Jan 06 '18 at 01:32
  • I'm using Python version 3.6.2, and the system is 64 bit (MacBook Air). – Hallvard Jan 06 '18 at 01:36
  • [This question](https://stackoverflow.com/q/46458537/364696) appears to be an exact duplicate, but it's also unanswered. I'm guessing some bug in how Python is invoking a Mac system call that works correctly on all other POSIX systems, but I lack a Mac to test, and searches on the Python bug tracker are coming up empty. – ShadowRanger Jan 06 '18 at 01:37

2 Answers2

16

There have been several issues over the history of Python (most fixed in recent versions) reading more than 2-4 GB at once from a file handle (an unfixable version of the problem also occurs on 32 bit builds of Python, where they simply lack the virtual address space to allocate the buffer; not I/O related, but seen most frequently slurping large files). A workaround available for hashing is to update the hash in fixed size chunks (which is a good idea anyway, since counting on RAM being greater than file size is a poor idea). The most straightforward approach is to change your code to:

with open(file, 'rb') as f:
    hasher = hashlib.sha256()  # Make empty hasher to update piecemeal
    while True:
        block = f.read(64 * (1 << 20)) # Read 64 MB at a time; big, but not memory busting
        if not block:  # Reached EOF
            break
        hasher.update(block)  # Update with new block
print('SHA256 of file is %s' % hasher.hexdigest())  # Finalize to compute digest

If you're feeling fancy, you can "simplify" the loop using two-arg iter and some functools magic, replacing the whole of the while loop with:

for block in iter(functools.partial(f.read, 64 * (1 << 20)), b''):
    hasher.update(block)

Or on Python 3.8+, with the walrus operator, := it's simpler without the need for imports or unreadable code:

while block := f.read(64 * (1 << 20)):  # Assigns and tests result in conditional!
    hasher.update(block)
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
-3

Wow this can be much simpler. Just read the file line by line:

with open('big-file.txt') as f:
  for i in f:
    print(i)
duhaime
  • 25,611
  • 17
  • 169
  • 224