Read numpy data from GZip file over the network

Question

I am attempting to download the MNIST dataset and decode it without writing it to disk (mostly for fun).

request_stream = urlopen('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz')
zip_file = GzipFile(fileobj=request_stream, mode='rb')
with zip_file as fd:
    magic, numberOfItems = struct.unpack('>ii', fd.read(8))
    rows, cols = struct.unpack('>II', fd.read(8))
    images = np.fromfile(fd, dtype='uint8') # < here be dragons
    images = images.reshape((numberOfItems, rows, cols))
    return images

This code fails with OSError: obtaining file position failed, an error that seems to be ungoogleable. What could the problem be?

oarfish · Accepted Answer · 2017-11-06T18:36:09.483

The problem seems to be, that what gzip and similar modules provide, aren't real file objects (unsurprisingly), but numpy attempts to read through the actual FILE* pointer, so this cannot work.

If it's ok to read the entire file into memory (which it might not be), then this can be worked around by reading all non-header information into a bytearray and deserializing from that:

rows, cols = struct.unpack('>II', fd.read(8))
b = bytearray(fd.read())
images = np.frombuffer(b, dtype='uint8')
images = images.reshape((numberOfItems, rows, cols))
return images

Read numpy data from GZip file over the network

1 Answers1