2

In this question "Generating an MD5 checksum of a file", I had this code:

import hashlib
def hashfile(afile, hasher, blocksize=65536):
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    return hasher.digest()

[(fname, hashfile(open(fname, 'rb'), hashlib.sha256())) for fname in fnamelst]

I was criticized for opening a file inside of a list comprehension, and one person opined that if I had a long enough list I would run out of open file handles. Interfaces which significantly reduced hashfile's flexibility and had hashfile taking a filename argument and using with were suggested.

Were these necessary? Was I really doing something that wrong?

Testing out this code:

#!/usr/bin/python3

import sys
from pprint import pprint # Pretty printing

class HereAndGone(object):
    def __init__(self, i):
        print("%d %x -> coming into existence." % (i, id(self)),
              file=sys.stderr)
        self.i_ = i
    def __del__(self):
        print("%d %x <- going away now." % (self.i_, id(self)),
              file=sys.stderr)

def do_nothing(hag):
    return id(hag)

l = [(i, do_nothing(HereAndGone(i))) for i in range(0, 10)]

pprint(l)

results in this output:

0 7f0346decef0 -> coming into existence.
0 7f0346decef0 <- going away now.
1 7f0346decef0 -> coming into existence.
1 7f0346decef0 <- going away now.
2 7f0346decef0 -> coming into existence.
2 7f0346decef0 <- going away now.
3 7f0346decef0 -> coming into existence.
3 7f0346decef0 <- going away now.
4 7f0346decef0 -> coming into existence.
4 7f0346decef0 <- going away now.
5 7f0346decef0 -> coming into existence.
5 7f0346decef0 <- going away now.
6 7f0346decef0 -> coming into existence.
6 7f0346decef0 <- going away now.
7 7f0346decef0 -> coming into existence.
7 7f0346decef0 <- going away now.
8 7f0346decef0 -> coming into existence.
8 7f0346decef0 <- going away now.
9 7f0346decef0 -> coming into existence.
9 7f0346decef0 <- going away now.
[(0, 139652050636528),
 (1, 139652050636528),
 (2, 139652050636528),
 (3, 139652050636528),
 (4, 139652050636528),
 (5, 139652050636528),
 (6, 139652050636528),
 (7, 139652050636528),
 (8, 139652050636528),
 (9, 139652050636528)]

It's obvious that each HereAndGone object is being created and destroyed as each element of the list comprehension is constructed. Python reference counting frees the object as soon as there are no references to it, which happens immediately after the value for that list element is computed.

Of course, maybe some other Python implementations don't do this. Is it required for a Python implementation to do some form of reference counting? It certainly seems from the documentation of the gc module like reference counting is a core feature of the language.

And, if I did do something wrong, how would you suggest I re-write it to retain the succinct clarity of a list comprehension and the flexibility of an interface that works with anything that can be read like a file?

Community
  • 1
  • 1
Omnifarious
  • 54,333
  • 19
  • 131
  • 194
  • "Is it required for a Python implementation to do some form of reference counting?" - no. – user2357112 Dec 29 '16 at 08:26
  • 1
    "It certainly seems from the documentation of the gc module like reference counting is a core feature of the language." - most of the gc module, especially the parts about turning it off, should be considered optional features. – user2357112 Dec 29 '16 at 08:28
  • Modify `hashfile` so it takes a file name and deals with opening and closing the file itself. It's a terrible idea to use the system's memory management to manage other resources in general. –  Dec 29 '16 at 08:30
  • @tfb - That seems horribly ugly. Why should a perfectly good general purpose function be turned into a very specific purpose function just to control the scope of a resource? – Omnifarious Dec 29 '16 at 08:31
  • @tfb - And in C++ using scope and object creation and destruction as a way of managing resources other than memory is considered good practice. So, it's not a "terrible idea" in the general way it seems like you're implying. It's just a bad idea in a language that has ill-defined semantics about when objects are destroyed. – Omnifarious Dec 29 '16 at 08:36
  • @Omnifarious: if that function is used elsewhere then obviously write a wrapper (perhaps a local function): if not, then, well, not: it doesn't need to be general if nothing uses it. *Scope* is a fine way of managing resources other than memory, and indeed you'd obviously do that using `with ...` in Python which is as close as Python gets to a scope operator. I tend to take the view that a good approach to any problem is 'look at what C++ does, and don't do that'... –  Dec 29 '16 at 10:06
  • 2
    Quoting from "Data model" section of the Language reference: _Do not depend on immediate finalization of objects when they become unreachable (so you should always close files explicitly)_ – VPfB Dec 29 '16 at 13:07
  • @VPfB - That looks like part of a definite answer there. I'll make it if you don't in an hour or three. – Omnifarious Dec 29 '16 at 16:47
  • @Omnifarious Please go ahead. I cannot just now. – VPfB Dec 29 '16 at 20:27
  • @VPfB - Done. :-) I've now answered my own question. What do you think? – Omnifarious Jan 06 '17 at 19:26
  • The standard library contains various functions with a disclaimer saying that they work only in CPython, e.g. [`inspect.currentframe`](https://docs.python.org/3/library/inspect.html#inspect.currentframe). – Bakuriu Jan 06 '17 at 19:53
  • @Omnifarious I couldn't have written the part about finalization better. – VPfB Jan 07 '17 at 10:31

1 Answers1

1

Someone pointed to the Data Model section of the Python Language Reference where it says very explicitly "Do not depend on immediate finalization of objects when they become unreachable (so you should always close files explicitly).". So, this makes it clear that the code is relying on behavior that's not guaranteed.

And even if it were, it's still fragile. It invisibly relies on the file to never be referenced by a data structure that has a circular reference or a lifetime beyond the hashing of an individual file. Who knows what might happen to the code in the future and whether someone will remember this key detail?

The question is what to do about it. The hashfile function in the question is nicely flexible, and it seems a shame to fiddle its interface to take a filename and have it open the file inside the function and thereby kill its flexibility. I think the minimal solution is this:

I feel the solution is to rethink the interface a bit and make it even more general.

def hash_bytestr_iter(hasher, bytesiter, ashexstr=False):
    for block in bytesiter:
        hasher.update(bytesiter)
    return (hasher.hexdigest() if ashexstr else hasher.digest())

def iter_and_close_file(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block

One could just make the original hashfile use the passed in afile as a context manager, but I feel this sort of breaks expectations in a subtle way. It makes hashfile close the file, and its name sort of promises just that it will compute a hash, not close the file.

And I suspect there are a lot of cases when you have blocks of bytes and would like to hash them all as if they were part of a continuous block or stream of them. Hashing an iterator over blocks of bytes is even more general then hashing a file.

And similarly I think there are a lot of cases where you would like to iterate over a file-like object and then close it. So that makes both of these functions re-usable and general.

Omnifarious
  • 54,333
  • 19
  • 131
  • 194