Extract images from .idx3-ubyte file or GZIP via Python

Question

I have created a simple function for facerecognition by using the facerecognizer from OpenCV. It works all fine with images from people.

Now I would like to make a test by using handwritten characters instead of people. I came across MNIST dataset, but they store images in a weird file which I have never seen before.

I simply need to extract a few images from:

train-images.idx3-ubyte

and save them in a folder as .gif

Or am I missunderstand this MNIST thing. If yes where could I get such a dataset?

EDIT

I also have the gzip file:

train-images-idx3-ubyte.gz

I am trying to read the content, but show() does not work and if I read() I see random symbols.

images = gzip.open("train-images-idx3-ubyte.gz", 'rb')
print images.read()

EDIT

Managed to get some usefull output by using:

with gzip.open('train-images-idx3-ubyte.gz','r') as fin:
    for line in fin:
        print('got line', line)

Somehow I have to convert this now to an image, output:

`python-mnist` package on PyPI has some code can do the job. — Kh40tiK, Nov 04 '16 at 17:08
The file format of `.idx3-ubyte` is described in [THE MNIST DATABASE](http://yann.lecun.com/exdb/mnist/) page. — Laurent LAPORTE, Nov 04 '16 at 18:57
If anyone is wondering where you can find all these dataset? Here is the link -> http://yann.lecun.com/exdb/mnist/ — Rohit Singh, Oct 07 '20 at 06:45

Laurent LAPORTE · Answer 1 · 2019-02-25T12:22:43.097

Download the training/test images and labels:

train-images-idx3-ubyte.gz: training set images
train-labels-idx1-ubyte.gz: training set labels
t10k-images-idx3-ubyte.gz: test set images
t10k-labels-idx1-ubyte.gz: test set labels

And uncompress them in a workdir, say samples/.

Get the python-mnist package from PyPi:

pip install python-mnist

Import the mnist package and read the training/test images:

from mnist import MNIST

mndata = MNIST('samples')

images, labels = mndata.load_training()
# or
images, labels = mndata.load_testing()

To display an image to the console:

index = random.randrange(0, len(images))  # choose an index ;-)
print(mndata.display(images[index]))

You'll get something like this:

............................
............................
............................
............................
............................
.................@@.........
..............@@@@@.........
............@@@@............
..........@@................
..........@.................
...........@................
...........@................
...........@...@............
...........@@@@@.@..........
...........@@@...@@.........
...........@@.....@.........
..................@.........
..................@@........
..................@@........
..................@.........
.................@@.........
...........@.....@..........
...........@....@@..........
............@@@@............
.............@..............
............................
............................
............................

Explanation:

Each image of the images list is a Python list of unsigned bytes.
The labels is an Python array of unsigned bytes.

note that when you extract the files, rename the dots to `-` (or you will get a file missing error), for example `t10k-images.idx3-ubyte` must be renamed to `t10k-images-idx3-ubyte` — Abdelouahab, Jul 23 '17 at 17:49

Punnerud · Answer 2 · 2019-04-06T11:46:13.380

58

(Using only matplotlib, gzip and numpy)
Extract image data:

import gzip
f = gzip.open('train-images-idx3-ubyte.gz','r')

image_size = 28
num_images = 5

import numpy as np
f.read(16)
buf = f.read(image_size * image_size * num_images)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
data = data.reshape(num_images, image_size, image_size, 1)

Print images:

import matplotlib.pyplot as plt
image = np.asarray(data[2]).squeeze()
plt.imshow(image)
plt.show()

Print first 50 labels:

f = gzip.open('train-labels-idx1-ubyte.gz','r')
f.read(8)
for i in range(0,50):   
    buf = f.read(1)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
    print(labels)

edited Apr 06 '19 at 11:46

answered Dec 01 '18 at 12:18

Punnerud

7,195
2
54
44

7

Is the f.read(16) and f.read(8) skipping non-image information? – DuttaA Mar 14 '19 at 13:35
Rewritten now for easier understanding. Yes, the two first bytes (f.read(8)) is always 0. Read more about the IDX(MNIST)-format here: http://yann.lecun.com/exdb/mnist/ – Punnerud Apr 06 '19 at 11:45
But you wrote 100 labels, but changed it to 50? – DuttaA Apr 06 '19 at 11:45
Thanks, fixed. Felt it did't ad extra value to display a lot of data on the screen when it was only vertically. Had its purpose when it was horizontal+vertical stacked. – Punnerud Apr 06 '19 at 11:49
Hi, if I want to show images from `train-labels-idx1-ubyte` (already without .gz) then what I have to do? – mostafiz67 Sep 26 '20 at 18:50
1

@mostafiz67 Hi, you can do `f = open('train-labels-idx1-ubyte', 'rb')`. That way you will be just opening the file with python's open function in binary mode. – gonzarodriguezt Nov 09 '20 at 00:31

score 19 · Answer 3 · answered Apr 03 '19 at 21:29

You could actually use the idx2numpy package available at PyPI. It's extremely simple to use and directly converts the data to numpy arrays. Here's what you have to do:

Downloading the data

Download the MNIST dataset from the official website.
If you're using Linux then you can use wget to get it from command line itself. Just run:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Decompressing the data

Unzip or decompress the data. On Linux, you could use gzip

Ultimately, you should have the following files:

data/train-images-idx3-ubyte
data/train-labels-idx1-ubyte
data/t10k-images-idx3-ubyte
data/t10k-labels-idx1-ubyte

The prefix data/ is just because I've extracted them into a folder named data. Your question looks like you're well done till here, so keep reading.

Using idx2numpy

Here's a simple python code to read everything from the decompressed files as numpy arrays.

import idx2numpy
import numpy as np
file = 'data/train-images-idx3-ubyte'
arr = idx2numpy.convert_from_file(file)
# arr is now a np.ndarray type of object of shape 60000, 28, 28

You can now use it with OpenCV juts the same way how you display any other image, using something like

cv.imshow("Image", arr[4])

To install idx2numpy, you can use PyPI (pip package manager). Simply run the command:

pip install idx2numpy

Nice end-to-end tutorial. This utility works not only with the Digits mnist, but also with Fashion mnist, too (found here -- https://github.com/zalandoresearch/fashion-mnist); or any other idx formatted file. — NYCeyes, Oct 17 '19 at 13:25

score 16 · Answer 4 · edited Sep 07 '21 at 20:44

16

install idx2numpy

pip install idx2numpy

Downloading the data

Download the MNIST dataset from the official website.

Decompressing the data

Ultimately, you should have the following files:

train-images-idx3-ubyte
train-labels-idx1-ubyte
t10k-images-idx3-ubyte
t10k-labels-idx1-ubyte

Using idx2numpy

import numpy as np
import idx2numpy
import matplotlib.pyplot as plt

imagefile = 'train-images.idx3-ubyte'
imagearray = idx2numpy.convert_from_file(imagefile)

plt.imshow(imagearray[4], cmap=plt.cm.binary)

edited Sep 07 '21 at 20:44

Wasi Master

1,112
2
11
22

answered Apr 24 '20 at 06:58

ho_khalaf

161
1
5

Excellent. I tried most of the answers and only this one works perfectly. – Amir Pourmand May 25 '21 at 08:43

score 15 · Answer 5 · answered Jul 07 '20 at 18:07

import gzip
import numpy as np


def training_images():
    with gzip.open('data/train-images-idx3-ubyte.gz', 'r') as f:
        # first 4 bytes is a magic number
        magic_number = int.from_bytes(f.read(4), 'big')
        # second 4 bytes is the number of images
        image_count = int.from_bytes(f.read(4), 'big')
        # third 4 bytes is the row count
        row_count = int.from_bytes(f.read(4), 'big')
        # fourth 4 bytes is the column count
        column_count = int.from_bytes(f.read(4), 'big')
        # rest is the image pixel data, each pixel is stored as an unsigned byte
        # pixel values are 0 to 255
        image_data = f.read()
        images = np.frombuffer(image_data, dtype=np.uint8)\
            .reshape((image_count, row_count, column_count))
        return images


def training_labels():
    with gzip.open('data/train-labels-idx1-ubyte.gz', 'r') as f:
        # first 4 bytes is a magic number
        magic_number = int.from_bytes(f.read(4), 'big')
        # second 4 bytes is the number of labels
        label_count = int.from_bytes(f.read(4), 'big')
        # rest is the label data, each label is stored as unsigned byte
        # label values are 0 to 9
        label_data = f.read()
        labels = np.frombuffer(label_data, dtype=np.uint8)
        return labels

'big' means big endian which defines the byte order. In big endian the most significant byte of the word is stored in smaller memory address. — UdaraWanasinghe, Jul 15 '20 at 04:53
Can I change np.uint8 to np.float32? When I did this, the number of images changed to 15000, instead of 60000. — X.G, Sep 30 '22 at 00:49

score 1 · Answer 6 · answered Mar 04 '21 at 16:09

here directly a function for you ! (it loads in binary format .ie 0 or 1).

def load_mnist(train_data=True, test_data=False):
    """
    Get mnist data from the official website and
    load them in binary format.

    Parameters
    ----------
    train_data : bool
        Loads
        'train-images-idx3-ubyte.gz'
        'train-labels-idx1-ubyte.gz'
    test_data : bool
        Loads
        't10k-images-idx3-ubyte.gz'
        't10k-labels-idx1-ubyte.gz' 

    Return
    ------
    tuple
    tuple[0] are images (train & test)
    tuple[1] are labels (train & test)

    """
    RESOURCES = [
        'train-images-idx3-ubyte.gz',
        'train-labels-idx1-ubyte.gz',
        't10k-images-idx3-ubyte.gz',
        't10k-labels-idx1-ubyte.gz']

    if (os.path.isdir('data') == 0):
        os.mkdir('data')
    if (os.path.isdir('data/mnist') == 0):
        os.mkdir('data/mnist')
    for name in RESOURCES:
        if (os.path.isfile('data/mnist/'+name) == 0):
            url = 'http://yann.lecun.com/exdb/mnist/'+name
            r = requests.get(url, allow_redirects=True)
            open('data/mnist/'+name, 'wb').write(r.content)

    return get_images(train_data, test_data), get_labels(train_data, test_data)


def get_images(train_data=True, test_data=False):

    to_return = []

    if train_data:
        with gzip.open('data/mnist/train-images-idx3-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of images
            image_count = int.from_bytes(f.read(4), 'big')
            # third 4 bytes is the row count
            row_count = int.from_bytes(f.read(4), 'big')
            # fourth 4 bytes is the column count
            column_count = int.from_bytes(f.read(4), 'big')
            # rest is the image pixel data, each pixel is stored as an unsigned byte
            # pixel values are 0 to 255
            image_data = f.read()
            train_images = np.frombuffer(image_data, dtype=np.uint8)\
                .reshape((image_count, row_count, column_count))
            to_return.append(np.where(train_images > 127, 1, 0))

    if test_data:
        with gzip.open('data/mnist/t10k-images-idx3-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of images
            image_count = int.from_bytes(f.read(4), 'big')
            # third 4 bytes is the row count
            row_count = int.from_bytes(f.read(4), 'big')
            # fourth 4 bytes is the column count
            column_count = int.from_bytes(f.read(4), 'big')
            # rest is the image pixel data, each pixel is stored as an unsigned byte
            # pixel values are 0 to 255
            image_data = f.read()
            test_images = np.frombuffer(image_data, dtype=np.uint8)\
                .reshape((image_count, row_count, column_count))
            to_return.append(np.where(test_images > 127, 1, 0))

    return to_return


def get_labels(train_data=True, test_data=False):

    to_return = []

    if train_data:
        with gzip.open('data/mnist/train-labels-idx1-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of labels
            label_count = int.from_bytes(f.read(4), 'big')
            # rest is the label data, each label is stored as unsigned byte
            # label values are 0 to 9
            label_data = f.read()
            train_labels = np.frombuffer(label_data, dtype=np.uint8)
            to_return.append(train_labels)
    if test_data:
        with gzip.open('data/mnist/t10k-labels-idx1-ubyte.gz', 'r') as f:
            # first 4 bytes is a magic number
            magic_number = int.from_bytes(f.read(4), 'big')
            # second 4 bytes is the number of labels
            label_count = int.from_bytes(f.read(4), 'big')
            # rest is the label data, each label is stored as unsigned byte
            # label values are 0 to 9
            label_data = f.read()
            test_labels = np.frombuffer(label_data, dtype=np.uint8)
            to_return.append(test_labels)

    return to_return

Ciro Santilli OurBigBook.com · Answer 7 · 2023-04-13T07:12:39.290

Mass convert to PNG files

https://github.com/myleott/mnist_png/blob/400fe88faba05ae79bbc2107071144e6f1ea2720/convert_mnist_to_png.py contains a good PNG extraction example, licensed under GPL 2.0. Should be easy to adapt to other output formats with a library like Pillow.

They also have a pre-extracted archive at: https://github.com/myleott/mnist_png/blob/master/mnist_png.tar.gz?raw=true

Usage:

wget \
 http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz \
 http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
gunzip --keep *-ubyte.gz
python3 -m pip install pypng==0.20220715.0
./convert_mnist_to_png.py . out

And now out/ contains files such as:

out/training/0/1.png

out/training/0/21.png

out/training/1/3.png

out/training/1/6.png

out/testing/0/10.png

out/testing/0/13.png

convert_mnist_to_png.py

#!/usr/bin/env python

import os
import struct
import sys

from array import array
from os import path

import png

# source: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py
def read(dataset = "training", path = "."):
    if dataset is "training":
        fname_img = os.path.join(path, 'train-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 'train-labels-idx1-ubyte')
    elif dataset is "testing":
        fname_img = os.path.join(path, 't10k-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 't10k-labels-idx1-ubyte')
    else:
        raise ValueError("dataset must be 'testing' or 'training'")

    flbl = open(fname_lbl, 'rb')
    magic_nr, size = struct.unpack(">II", flbl.read(8))
    lbl = array("b", flbl.read())
    flbl.close()

    fimg = open(fname_img, 'rb')
    magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
    img = array("B", fimg.read())
    fimg.close()

    return lbl, img, size, rows, cols

def write_dataset(labels, data, size, rows, cols, output_dir):
    # create output directories
    output_dirs = [
        path.join(output_dir, str(i))
        for i in range(10)
    ]
    for dir in output_dirs:
        if not path.exists(dir):
            os.makedirs(dir)

    # write data
    for (i, label) in enumerate(labels):
        output_filename = path.join(output_dirs[label], str(i) + ".png")
        print("writing " + output_filename)
        with open(output_filename, "wb") as h:
            w = png.Writer(cols, rows, greyscale=True)
            data_i = [
                data[ (i*rows*cols + j*cols) : (i*rows*cols + (j+1)*cols) ]
                for j in range(rows)
            ]
            w.write(h, data_i)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("usage: {0} <input_path> <output_path>".format(sys.argv[0]))
        sys.exit()

    input_path = sys.argv[1]
    output_path = sys.argv[2]

    for dataset in ["training", "testing"]:
        labels, data, size, rows, cols = read(dataset, input_path)
        write_dataset(labels, data, size, rows, cols,
                      path.join(output_path, dataset))

Inspecting the generated PNGs with:

identify out/testing/0/10.png

gives:

out/testing/0/10.png PNG 28x28 28x28+0+0 8-bit Gray 256c 272B 0.000u 0:00.000

so they appear to be Grayscale and 8-bit, and therefore should faithfully represent the original data.

Tested on Ubuntu 22.10.

score -4 · Answer 8 · answered Jun 27 '20 at 12:26

-4

I had the same issue.

Whenever i unzipped the files into executables the extension was not removed so I had:

train-images-idx3-ubyte.gz

Whenever I removed the: .gz, I had:

train-images-idx3-ubyte

This fixed my issue.

answered Jun 27 '20 at 12:26

Trey

19
4

Extract images from .idx3-ubyte file or GZIP via Python

8 Answers8

Downloading the data

Decompressing the data

Using idx2numpy

Linked