58

I have created a simple function for facerecognition by using the facerecognizer from OpenCV. It works all fine with images from people.

Now I would like to make a test by using handwritten characters instead of people. I came across MNIST dataset, but they store images in a weird file which I have never seen before.

I simply need to extract a few images from:

train-images.idx3-ubyte

and save them in a folder as .gif

Or am I missunderstand this MNIST thing. If yes where could I get such a dataset?

EDIT

I also have the gzip file:

train-images-idx3-ubyte.gz

I am trying to read the content, but show() does not work and if I read() I see random symbols.

images = gzip.open("train-images-idx3-ubyte.gz", 'rb')
print images.read()

EDIT

Managed to get some usefull output by using:

with gzip.open('train-images-idx3-ubyte.gz','r') as fin:
    for line in fin:
        print('got line', line)

Somehow I have to convert this now to an image, output:

enter image description here

mzakaria
  • 599
  • 3
  • 21
Roman
  • 3,563
  • 5
  • 48
  • 104

8 Answers8

78

Download the training/test images and labels:

  • train-images-idx3-ubyte.gz: training set images
  • train-labels-idx1-ubyte.gz: training set labels
  • t10k-images-idx3-ubyte.gz: test set images
  • t10k-labels-idx1-ubyte.gz: test set labels

And uncompress them in a workdir, say samples/.

Get the python-mnist package from PyPi:

pip install python-mnist

Import the mnist package and read the training/test images:

from mnist import MNIST

mndata = MNIST('samples')

images, labels = mndata.load_training()
# or
images, labels = mndata.load_testing()

To display an image to the console:

index = random.randrange(0, len(images))  # choose an index ;-)
print(mndata.display(images[index]))

You'll get something like this:

............................
............................
............................
............................
............................
.................@@.........
..............@@@@@.........
............@@@@............
..........@@................
..........@.................
...........@................
...........@................
...........@...@............
...........@@@@@.@..........
...........@@@...@@.........
...........@@.....@.........
..................@.........
..................@@........
..................@@........
..................@.........
.................@@.........
...........@.....@..........
...........@....@@..........
............@@@@............
.............@..............
............................
............................
............................

Explanation:

  • Each image of the images list is a Python list of unsigned bytes.
  • The labels is an Python array of unsigned bytes.
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • 33
    note that when you extract the files, rename the dots to `-` (or you will get a file missing error), for example `t10k-images.idx3-ubyte` must be renamed to `t10k-images-idx3-ubyte` – Abdelouahab Jul 23 '17 at 17:49
58

(Using only matplotlib, gzip and numpy)
Extract image data:

import gzip
f = gzip.open('train-images-idx3-ubyte.gz','r')

image_size = 28
num_images = 5

import numpy as np
f.read(16)
buf = f.read(image_size * image_size * num_images)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
data = data.reshape(num_images, image_size, image_size, 1)

Print images:

import matplotlib.pyplot as plt
image = np.asarray(data[2]).squeeze()
plt.imshow(image)
plt.show()

enter image description here

Print first 50 labels:

f = gzip.open('train-labels-idx1-ubyte.gz','r')
f.read(8)
for i in range(0,50):   
    buf = f.read(1)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
    print(labels)
Punnerud
  • 7,195
  • 2
  • 54
  • 44
  • 7
    Is the f.read(16) and f.read(8) skipping non-image information? – DuttaA Mar 14 '19 at 13:35
  • Rewritten now for easier understanding. Yes, the two first bytes (f.read(8)) is always 0. Read more about the IDX(MNIST)-format here: http://yann.lecun.com/exdb/mnist/ – Punnerud Apr 06 '19 at 11:45
  • But you wrote 100 labels, but changed it to 50? – DuttaA Apr 06 '19 at 11:45
  • Thanks, fixed. Felt it did't ad extra value to display a lot of data on the screen when it was only vertically. Had its purpose when it was horizontal+vertical stacked. – Punnerud Apr 06 '19 at 11:49
  • Hi, if I want to show images from `train-labels-idx1-ubyte` (already without .gz) then what I have to do? – mostafiz67 Sep 26 '20 at 18:50
  • 1
    @mostafiz67 Hi, you can do `f = open('train-labels-idx1-ubyte', 'rb')`. That way you will be just opening the file with python's open function in binary mode. – gonzarodriguezt Nov 09 '20 at 00:31
19

You could actually use the idx2numpy package available at PyPI. It's extremely simple to use and directly converts the data to numpy arrays. Here's what you have to do:

Downloading the data

Download the MNIST dataset from the official website.
If you're using Linux then you can use wget to get it from command line itself. Just run:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Decompressing the data

Unzip or decompress the data. On Linux, you could use gzip

Ultimately, you should have the following files:

data/train-images-idx3-ubyte
data/train-labels-idx1-ubyte
data/t10k-images-idx3-ubyte
data/t10k-labels-idx1-ubyte

The prefix data/ is just because I've extracted them into a folder named data. Your question looks like you're well done till here, so keep reading.

Using idx2numpy

Here's a simple python code to read everything from the decompressed files as numpy arrays.

import idx2numpy
import numpy as np
file = 'data/train-images-idx3-ubyte'
arr = idx2numpy.convert_from_file(file)
# arr is now a np.ndarray type of object of shape 60000, 28, 28

You can now use it with OpenCV juts the same way how you display any other image, using something like

cv.imshow("Image", arr[4])

To install idx2numpy, you can use PyPI (pip package manager). Simply run the command:

pip install idx2numpy
Avneesh Mishra
  • 518
  • 4
  • 9
  • 1
    any way to get separated images and not mixed? – Vicrobot Jul 23 '19 at 09:57
  • 1
    Nice end-to-end tutorial. This utility works not only with the Digits mnist, but also with Fashion mnist, too (found here -- https://github.com/zalandoresearch/fashion-mnist); or any other idx formatted file. – NYCeyes Oct 17 '19 at 13:25
16

install idx2numpy

pip install idx2numpy

Downloading the data

Download the MNIST dataset from the official website.

Decompressing the data

Ultimately, you should have the following files:

train-images-idx3-ubyte
train-labels-idx1-ubyte
t10k-images-idx3-ubyte
t10k-labels-idx1-ubyte

Using idx2numpy

import numpy as np
import idx2numpy
import matplotlib.pyplot as plt

imagefile = 'train-images.idx3-ubyte'
imagearray = idx2numpy.convert_from_file(imagefile)

plt.imshow(imagearray[4], cmap=plt.cm.binary)

The mnist picture

Wasi Master
  • 1,112
  • 2
  • 11
  • 22
ho_khalaf
  • 161
  • 1
  • 5
15
import gzip
import numpy as np


def training_images():
    with gzip.open('data/train-images-idx3-ubyte.gz', 'r') as f:
        # first 4 bytes is a magic number
        magic_number = int.from_bytes(f.read(4), 'big')
        # second 4 bytes is the number of images
        image_count = int.from_bytes(f.read(4), 'big')
        # third 4 bytes is the row count
        row_count = int.from_bytes(f.read(4), 'big')
        # fourth 4 bytes is the column count
        column_count = int.from_bytes(f.read(4), 'big')
        # rest is the image pixel data, each pixel is stored as an unsigned byte
        # pixel values are 0 to 255
        image_data = f.read()
        images = np.frombuffer(image_data, dtype=np.uint8)\
            .reshape((image_count, row_count, column_count))
        return images


def training_labels():
    with gzip.open('data/train-labels-idx1-ubyte.gz', 'r') as f:
        # first 4 bytes is a magic number
        magic_number = int.from_bytes(f.read(4), 'big')
        # second 4 bytes is the number of labels
        label_count = int.from_bytes(f.read(4), 'big')
        # rest is the label data, each label is stored as unsigned byte
        # label values are 0 to 9
        label_data = f.read()
        labels = np.frombuffer(label_data, dtype=np.uint8)
        return labels
UdaraWanasinghe
  • 2,622
  • 2
  • 21
  • 27
1

here directly a function for you ! (it loads in binary format .ie 0 or 1).

def load_mnist(train_data=True, test_data=False):
    """
    Get mnist data from the official website and
    load them in binary format.

    Parameters
    ----------
    train_data : bool
        Loads
        'train-images-idx3-ubyte.gz'
        'train-labels-idx1-ubyte.gz'
    test_data : bool
        Loads
        't10k-images-idx3-ubyte.gz'
        't10k-labels-idx1-ubyte.gz' 

    Return
    ------
    tuple
    tuple[0] are images (train & test)
    tuple[1] are labels (train & test)

    """
    RESOURCES = [
        'train-images-idx3-ubyte.gz',
        'train-labels-idx1-ubyte.gz',
        't10k-images-idx3-ubyte.gz',
        't10k-labels-idx1-ubyte.gz']

    if (os.path.isdir('data') == 0):
        os.mkdir('data')
    if (os.path.isdir('data/mnist') == 0):
        os.mkdir('data/mnist')
    for name in RESOURCES:
        if (os.path.isfile('data/mnist/'+name) == 0):
            url = 'http://yann.lecun.com/exdb/mnist/'+name
            r = requests.get(url, allow_redirects=True)
            open('data/mnist/'+name, 'wb').write(r.content)

    return get_images(train_data, test_data), get_labels(train_data, test_data)


def get_images(train_data=True, test_data=False):

    to_return = []

    if train_data:
        with gzip.open('data/mnist/train-images-idx3-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of images
            image_count = int.from_bytes(f.read(4), 'big')
            # third 4 bytes is the row count
            row_count = int.from_bytes(f.read(4), 'big')
            # fourth 4 bytes is the column count
            column_count = int.from_bytes(f.read(4), 'big')
            # rest is the image pixel data, each pixel is stored as an unsigned byte
            # pixel values are 0 to 255
            image_data = f.read()
            train_images = np.frombuffer(image_data, dtype=np.uint8)\
                .reshape((image_count, row_count, column_count))
            to_return.append(np.where(train_images > 127, 1, 0))

    if test_data:
        with gzip.open('data/mnist/t10k-images-idx3-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of images
            image_count = int.from_bytes(f.read(4), 'big')
            # third 4 bytes is the row count
            row_count = int.from_bytes(f.read(4), 'big')
            # fourth 4 bytes is the column count
            column_count = int.from_bytes(f.read(4), 'big')
            # rest is the image pixel data, each pixel is stored as an unsigned byte
            # pixel values are 0 to 255
            image_data = f.read()
            test_images = np.frombuffer(image_data, dtype=np.uint8)\
                .reshape((image_count, row_count, column_count))
            to_return.append(np.where(test_images > 127, 1, 0))

    return to_return


def get_labels(train_data=True, test_data=False):

    to_return = []

    if train_data:
        with gzip.open('data/mnist/train-labels-idx1-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of labels
            label_count = int.from_bytes(f.read(4), 'big')
            # rest is the label data, each label is stored as unsigned byte
            # label values are 0 to 9
            label_data = f.read()
            train_labels = np.frombuffer(label_data, dtype=np.uint8)
            to_return.append(train_labels)
    if test_data:
        with gzip.open('data/mnist/t10k-labels-idx1-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of labels
            label_count = int.from_bytes(f.read(4), 'big')
            # rest is the label data, each label is stored as unsigned byte
            # label values are 0 to 9
            label_data = f.read()
            test_labels = np.frombuffer(label_data, dtype=np.uint8)
            to_return.append(test_labels)

    return to_return
gouzmi
  • 11
  • 1
0

Mass convert to PNG files

https://github.com/myleott/mnist_png/blob/400fe88faba05ae79bbc2107071144e6f1ea2720/convert_mnist_to_png.py contains a good PNG extraction example, licensed under GPL 2.0. Should be easy to adapt to other output formats with a library like Pillow.

They also have a pre-extracted archive at: https://github.com/myleott/mnist_png/blob/master/mnist_png.tar.gz?raw=true

Usage:

wget \
 http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
gunzip --keep *-ubyte.gz
python3 -m pip install pypng==0.20220715.0
./convert_mnist_to_png.py . out

And now out/ contains files such as:

out/training/0/1.png

out/training/0/21.png

out/training/1/3.png

out/training/1/6.png

out/testing/0/10.png

out/testing/0/13.png

convert_mnist_to_png.py

#!/usr/bin/env python

import os
import struct
import sys

from array import array
from os import path

import png

# source: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py
def read(dataset = "training", path = "."):
    if dataset is "training":
        fname_img = os.path.join(path, 'train-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 'train-labels-idx1-ubyte')
    elif dataset is "testing":
        fname_img = os.path.join(path, 't10k-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 't10k-labels-idx1-ubyte')
    else:
        raise ValueError("dataset must be 'testing' or 'training'")

    flbl = open(fname_lbl, 'rb')
    magic_nr, size = struct.unpack(">II", flbl.read(8))
    lbl = array("b", flbl.read())
    flbl.close()

    fimg = open(fname_img, 'rb')
    magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
    img = array("B", fimg.read())
    fimg.close()

    return lbl, img, size, rows, cols

def write_dataset(labels, data, size, rows, cols, output_dir):
    # create output directories
    output_dirs = [
        path.join(output_dir, str(i))
        for i in range(10)
    ]
    for dir in output_dirs:
        if not path.exists(dir):
            os.makedirs(dir)

    # write data
    for (i, label) in enumerate(labels):
        output_filename = path.join(output_dirs[label], str(i) + ".png")
        print("writing " + output_filename)
        with open(output_filename, "wb") as h:
            w = png.Writer(cols, rows, greyscale=True)
            data_i = [
                data[ (i*rows*cols + j*cols) : (i*rows*cols + (j+1)*cols) ]
                for j in range(rows)
            ]
            w.write(h, data_i)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("usage: {0} <input_path> <output_path>".format(sys.argv[0]))
        sys.exit()

    input_path = sys.argv[1]
    output_path = sys.argv[2]

    for dataset in ["training", "testing"]:
        labels, data, size, rows, cols = read(dataset, input_path)
        write_dataset(labels, data, size, rows, cols,
                      path.join(output_path, dataset))

Inspecting the generated PNGs with:

identify out/testing/0/10.png

gives:

out/testing/0/10.png PNG 28x28 28x28+0+0 8-bit Gray 256c 272B 0.000u 0:00.000

so they appear to be Grayscale and 8-bit, and therefore should faithfully represent the original data.

Tested on Ubuntu 22.10.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
-4

I had the same issue.

Whenever i unzipped the files into executables the extension was not removed so I had:

train-images-idx3-ubyte.gz

Whenever I removed the: .gz, I had:

train-images-idx3-ubyte

This fixed my issue.

Trey
  • 19
  • 4