12

I have downloaded the MNIST dataset from LeCun site. What I want is to write the Python code in order to extract the gzip and read the dataset directly from the directory, meaning that I don't have to download or access to the MNIST site anymore.

Desire process: Access folder/directory --> extract gzip --> read dataset (one hot encoding)

How to do it? Since almost all tutorials have to access to the either the LeCun or Tensoflow site to download and read the dataset. Thanks in advance!

Joshua
  • 409
  • 1
  • 4
  • 12
  • 2
    You should extract the gzip locally onto your computer and then use `scipy.misc.imread` or opencv to read images to Python. – yuji Jan 15 '18 at 05:17
  • Have you tried anything? – Vivek Kumar Jan 15 '18 at 09:32
  • Yes, I tried to remove the 'from tensorflow.examples.tutorials.mnist import input_data'. But it still downloading the dataset from the site. Still figuring out why even left this "mnist = input_data.read_data_sets('mnist_data/', one_hot=True)" line of code it still access and downloading the dataset. – Joshua Jan 15 '18 at 13:42

3 Answers3

9

This tensorflow call

from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')

... won't download anything it if you already have the files there.

But if for some reason you wish to unzip it yourself, here's how you do it:

from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels

with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels = extract_labels(f)

with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels = extract_labels(f)
binke ou
  • 15
  • 7
Maxim
  • 52,561
  • 27
  • 155
  • 209
  • If you have some time just look at these questions [https://stackoverflow.com/questions/64085547/mnist-datasets-from-google-drive-folder-showing-datasets-not-found] and [https://stackoverflow.com/questions/64080130/how-to-load-training-data-including-label-data-ubyte-format-of-images-from-loc] – mostafiz67 Sep 27 '20 at 08:20
9

If you have the MNIST data extracted, then you can load it low-level with NumPy directly:

def loadMNIST( prefix, folder ):
    intType = np.dtype( 'int32' ).newbyteorder( '>' )
    nMetaDataBytes = 4 * intType.itemsize

    data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
    magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
    data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )

    labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
                          dtype = 'ubyte' )[2 * intType.itemsize:]

    return data, labels

trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )

And to convert to hot-encoding:

def toHotEncoding( classification ):
    # emulates the functionality of tf.keras.utils.to_categorical( y )
    hotEncoding = np.zeros( [ len( classification ), 
                              np.max( classification ) + 1 ] )
    hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
    return hotEncoding

trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )
mxmlnkn
  • 1,887
  • 1
  • 19
  • 26
4

I will show how to load it from scratch(for better understanding), and show how to show digit image from it by matplotlib.pyplot

import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt

def load_data():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    # get the first image and it's label
    img1_arr, img1_label = X_train[0], y_train[0]
    print img1_arr.shape, img1_label
    # (784L,) , 5

    # reshape first image(1 D vector) to 2D dimension image
    img1_2d = np.reshape(img1_arr, (28, 28))
    # show it
    plt.subplot(111)
    plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
    plt.show()

enter image description here

You can also vectorize label to a 10-dimensional unit vector by this sample function:

def vectorized_result(label):
    e = np.zeros((10, 1))
    e[label] = 1.0
    return e

vectorize the above label:

print vectorized_result(img1_label)
# output as below:
[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]

If you want to translate it to CNN input, you can reshape it like this:

def load_data_v2():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
    y_train = np.array([vectorized_result(item) for item in y_train])

    print X_train.shape, y_train.shape
    # (50000L, 28L, 28L) (50000L, 10L, 1L)
Jayhello
  • 5,931
  • 3
  • 49
  • 56