0

The variable extract_path is a mnist training file, and then I use gzip module to extract the data from that file, what makes me puzzled is that the variable magic value perhaps is 2051, what does 2051 means for?

Second question for variable bytestream, it reads fours times, I don't know what it did that?


    def _read32(bytestream):
        dt = np.dtype(np.uint32).newbyteorder('>')
        return np.frombuffer(bytestream.read(4), dtype=dt)[0]


    with open(extract_path, 'rb') as f:
        with gzip.GzipFile(fileobj=f) as bytestream:
             magic = _read32(bytestream)
             if magic != 2051:
                raise ValueError('Invalid magic number {} in file: {}'.format(magic, f.name))
             num_images = _read32(bytestream)
             rows = _read32(bytestream)
             cols = _read32(bytestream)
             buf = bytestream.read(rows * cols * num_images)
             data = np.frombuffer(buf, dtype=np.uint8)
             data = data.reshape(num_images, rows, cols)

Any help is appreciated.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Johnny
  • 1,112
  • 1
  • 13
  • 21
  • 1
    BTW, see the *very* closely-related question (perhaps from another student working on the same problem?) at [The size parameter for gzip.open().read()](https://stackoverflow.com/questions/54019755/the-size-parameter-for-gzip-open-read/54024615#54024615) – Charles Duffy Jan 09 '19 at 13:55

1 Answers1

4

This has nothing to do with gzip or Python. It's part of the file format specification for training set image files in the MNIST database.

From http://yann.lecun.com/exdb/mnist/:

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns

Thus, the value 2051 is used to distinguish training set image files from other file types (such as label files, which use the magic number 2049).

And comparably, there are three more 4-byte / 32-bit values following the magic number indicating the number of image, number of rows, and number of columns; the subsequent _read32() calls thus consume that data, putting the values into variables num_images, rows and cols respectively.

The use of "magic numbers" in this context is consistent with the general meaning of "magic numbers" in the context of file formats, where these are constants used by libmagic (the tool which the file utility uses to guess file types). Better practice for newly-developed formats is to use proper UUIDs rather than short integers, which are much more likely to occur by chance.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441