3

I am new to programming and trying my hand at training an AI model with the MNIST database of handwritten digits. I already have a code that's working but now want to delve more into the details.

First thing I have to do in this project is to read through the .gz extension files, where integers are stored in the MBS first format. I have done this successfully by following code:

[ urllib.request.urlretrieve("http://yann.lecun.com/exdb/mnist/%s.gz" % file, "%s.gz" % file)

with gzip.open("%s.gz" % file, "rb") as f_in:
    with open("%s" % file, "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)][1]

I checked the description of what the urllib.request.urlretrieve() does and it says "Retrieve a URL into a temporary location on disk".

I want to understand if it's possible to do this same task without creating a local copy. Is it possible to read through an online .gz file in a different way without urlretrieve?

This is not a problem. I'm just curious and want to understand it better.

martineau
  • 119,623
  • 25
  • 170
  • 301
Gvantsa
  • 69
  • 6
  • Note that (at least) on Linux, temporary file means ram disk; and the file might not fit on the ram (so it will swap anyway). – user202729 Oct 01 '20 at 15:47

1 Answers1

2

Processing files without downloading the whole file before beginning is called "streaming". It is possible to stream a gzip-compressed file, as the decoding algorithm works by reading through the file sequentially.

You can use urllib.request.urlopen to create a streamed file object (as shown here), which you pass to a GzipFile instead of gzip.open(), like:

from urllib.request import urlopen

streamed_file = urlopen(f"http://yann.lecun.com/exdb/mnist/{file}.gz")
with gzip.GzipFile(fileobj=streamed_file) as f_in:
    with open(f"{file}", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

Note I'm using the new string formatting method

I haven't tested this code, but the idea should work because they all operate on "File-like objects", which basically just means they all implement the interface described by io.RawIOBase

Multihunter
  • 5,520
  • 2
  • 25
  • 38