128

I have a pkl file from MNIST dataset, which consists of handwritten digit images.

I'd like to take a look at each of those digit images, so I need to unpack the pkl file, except I can't find out how.

Is there a way to unpack/unzip pkl file?

martineau
  • 119,623
  • 25
  • 170
  • 301
ytrewq
  • 3,670
  • 9
  • 42
  • 71

4 Answers4

238

Generally

Your pkl file is, in fact, a serialized pickle file, which means it has been dumped using Python's pickle module.

To un-pickle the data you can:

import pickle


with open('serialized.pkl', 'rb') as f:
    data = pickle.load(f)

For the MNIST data set

Note gzip is only needed if the file is compressed:

import gzip
import pickle


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

Where each set can be further divided (i.e. for the training set):

train_x, train_y = train_set

Those would be the inputs (digits) and outputs (labels) of your sets.

If you want to display the digits:

import matplotlib.cm as cm
import matplotlib.pyplot as plt


plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()

mnist_digit

The other alternative would be to look at the original data:

http://yann.lecun.com/exdb/mnist/

But that will be harder, as you'll need to create a program to read the binary data in those files. So I recommend you to use Python, and load the data with pickle. As you've seen, it's very easy. ;-)

Peque
  • 13,638
  • 11
  • 69
  • 105
  • Is there also a way to make a pkl file out of the image files that I have? – ytrewq Aug 01 '14 at 16:38
  • Could be plain-old pickled, right? As opposed to cPickled? I'm not sure about the MNIST dataset, but for `pkl` files in general, `pickle.load` works for unpacking -- though I guess it performs less well than `cPickle.load`. For `pkl` files on the smaller side, the performance difference is not noticeable. – abcd Mar 06 '15 at 22:59
  • Also remember that, by default, `open` function has a default value of mode set to `r` (read), so it's important about opening a file with `rb` mode. If `b` (binary) mode is not added, unpickling might result in a `UnicodeDecodeError`. – Tomasz Bartkowiak Jan 28 '20 at 10:00
  • People using the `pickle` module should keep in mind that [it is not secure](https://docs.python.org/3/library/pickle.html) and should only be used to unpickle data from trusted sources as there is the possibility for arbitrary code execution during the unpickling process. If you are producing pickles, consider signing data with [hmac](https://docs.python.org/3/library/hmac.html#module-hmac) to ensure data has not been tampered with, or using alternative forms of serialisation like [JSON](https://docs.python.org/3/library/pickle.html#comparison-with-json). – Kyle F Hartzenberg Apr 12 '23 at 04:25
11

Handy one-liner

pkl() (
  python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl

Will print __str__ for the pickled object.

The generic problem of visualizing an object is of course undefined, so if __str__ is not enough, you will need a custom script, @dataclass + pprint may be of interest: Is there a built-in function to print all the current properties and values of an object?

Mass direct extraction of MNIST -idx3-ubyte.gz files to PNG

You can also easily download the official dataset files from http://yann.lecun.com/exdb/mnist/ and expand them to PNGs as per:

which uses the script from: https://github.com/myleott/mnist_png

Related: How to put my dataset in a .pkl file in the exact format and data structure used in "mnist.pkl.gz"?

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
2

In case you want to work with the original MNIST files, here is how you can deserialize them.

If you haven't downloaded the files yet, do that first by running the following in the terminal:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Then save the following as deserialize.py and run it.

import numpy as np
import gzip

IMG_DIM = 28

def decode_image_file(fname):
    result = []
    n_bytes_per_img = IMG_DIM*IMG_DIM

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[16:]

        if len(data) % n_bytes_per_img != 0:
            raise Exception('Something wrong with the file')

        result = np.frombuffer(data, dtype=np.uint8).reshape(
            len(bytes_)//n_bytes_per_img, n_bytes_per_img)

    return result

def decode_label_file(fname):
    result = []

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[8:]

        result = np.frombuffer(data, dtype=np.uint8)

    return result

train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')

test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')

The script doesn't normalize the pixel values like in the pickled file. To do that, all you have to do is

train_images = train_images/255
test_images = test_images/255
osolmaz
  • 1,873
  • 2
  • 24
  • 41
2

The pickle (and gzip if the file is compressed) module need to be used

NOTE: These are already in the standard Python library. No need to install anything new

crabman84
  • 77
  • 8