How to correctly parse Mnist dataset(idx format) into python arrays?

Question

I'm new to machine learniing, and i tried to avoid downloading the mnist dataset from the openml module, everytime i needed to work on the dataset.i saw this code online that helped me convert the idx file into python arrays,but i have an issue with my train_set labels which keeps coming up short of 8 values, i believe it has to do with the way i converted it.

import numpy as np
import struct

with open('train-images.idx3-ubyte', 'rb') as f:
    magic, size = struct.unpack('>II', f.read(8))
    nrows, ncols = struct.unpack('>II', f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8)).newbyteorder(">")
    data = data.reshape((size,nrows,ncols))

with open('train-labels.idx1-ubyte', 'rb') as i:
    magic, size = struct.unpack('>II', i.read(8))
    nrows, ncols = struct.unpack('>II', i.read(8))
    data_1 = np.fromfile(i, dtype=np.dtype(np.uint8)).newbyteorder(">")    
    
x_train, y_train = data, data_1
len(x_train), len(y_train)

>>> (60000,59992)

as shown in the code above, this issue has made my labels become faulty as not all train images would be linked correctly.And I have tried multiple downloads of the file to ensure I didnt acquire a corrupted one.Please, I need help.Thanks

this worked fine for me,but like i noted in the question,i dont want to download the dataset each time i resume the work.PS i am using Jupyter notebook — Brian Obot, Jul 17 '20 at 16:41

score 1 · Accepted Answer · answered Jul 17 '20 at 17:12

Check the documentation

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.

First 4 bytes is the magic number, next 4 number of items. After that labels start. So you have to jump over 8 bytes to reach lables. But you are jumping by 16 bytes which skip over few labels.

Fix

with open('train-labels.idx1-ubyte', 'rb') as i:
    magic, size = struct.unpack('>II', i.read(8))
    data_1 = np.fromfile(i, dtype=np.dtype(np.uint8)).newbyteorder(">")

I was looking for an answer to my question (https://stackoverflow.com/questions/65156592/convert-mnist-data-from-numpy-arrays-to-original-ubyte-data) and I came across this, can this method be adapted to address my question? If possible, could you show me how? — Slowat_Kela, Dec 05 '20 at 13:15

How to correctly parse Mnist dataset(idx format) into python arrays?

1 Answers1

Fix