How to use datasets.fetch_mldata() in sklearn?

Question

I am trying to run the following code for a brief machine learning algorithm:

import re
import argparse
import csv
from collections import Counter
from sklearn import datasets
import sklearn
from sklearn.datasets import fetch_mldata

dataDict = datasets.fetch_mldata('MNIST Original')

In this piece of code, I am trying to read the dataset 'MNIST Original' present at mldata.org via sklearn. This results in the following error(there are more lines of code but I am getting error at this particular line):

Traceback (most recent call last):
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module>
    debugger.run(setup['file'], None, None)
  File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run
    pydev_imports.execfile(file, globals, locals) #execute the script
  File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module>
    dataDict = datasets.fetch_mldata('MNIST Original')
  File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables
    res = self.read_var_array(hdr, process)
  File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
  File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717)
  File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147)
  File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134)
  File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704)
  File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429)
  File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711)
IOError: could not read bytes

I have tried researching on internet but there is hardly any help available. Any expert help related to solving this error will be much appreciated.

TIA.

skovorodkin · Answer 1 · 2019-02-10T12:52:38.473

32

As of version 0.20, sklearn deprecates fetch_mldata function and adds fetch_openml instead.

Download MNIST dataset with the following code:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

There are some changes to the format though. For instance, mnist['target'] is an array of string category labels (not floats as before).

edited Feb 10 '19 at 12:52

answered Feb 10 '19 at 12:21

skovorodkin

9,394
1
39
30

score 10 · Answer 2 · answered Nov 04 '14 at 20:41

10

Looks like the cached data are corrupted. Try removing them and download again (it takes a moment). If not specified differently the data for 'MINST original' should be in

~/scikit_learn_data/mldata/mnist-original.mat

answered Nov 04 '14 at 20:41

Szymon Laszczyński

101
1
5

Soundous Bahri · Answer 3 · 2018-12-18T15:55:38.337

5

I downloaded the dataset from this link

https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat

then I typed these lines

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', transpose_data=True, data_home='files')

*** the path is (your working directory)/files/mldata/mnist-original.mat

I hope you get it , it worked well for me

edited Dec 18 '18 at 15:55

answered Jul 27 '18 at 19:03

Soundous Bahri

96
1
4

The first time you run it, it will create an mldata folder. Paste the downloaded file in the mldata folder. Then run the application again it will work well using the downloaded copy in your local directory rather than trying to download from the internet. It will also be faster. – Wahome Jul 28 '21 at 04:34

score 1 · Answer 4 · answered Jan 13 '16 at 22:48

Here is some sample code how to get MNIST data ready to use for sklearn:

def get_data():
    """
    Get MNIST data ready to learn with.

    Returns
    -------
    dict
        With keys 'train' and 'test'. Both do have the keys 'X' (features)
        and'y' (labels)
    """
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1] - This is of mayor importance!!!
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                        test_size=0.33,
                                                        random_state=42)
    data = {'train': {'X': x_train,
                      'y': y_train},
            'test': {'X': x_test,
                     'y': y_test}}
    return data

score 1 · Answer 5 · answered Jul 09 '17 at 22:58

1

I experienced the same issue and found different file size of mnist-original.mat at different times while I use my poor WiFi. I switched to LAN and it works fine. It maybe the issue of networking.

answered Jul 09 '17 at 22:58

YH Hsu

11
1

score 0 · Answer 6 · answered Mar 14 '14 at 21:44

0

Try it like this:

dataDict = fetch_mldata('MNIST original')

This worked for me. Since you used the from ... import ... syntax, you shouldn't prepend datasets when you use it

answered Mar 14 '14 at 21:44

Brent

719
2
9
18

score 0 · Answer 7 · answered Jan 27 '16 at 21:24

I was also getting a fetch_mldata() "IOError: could not read bytes" error. Here is the solution; the relevant lines of code are

from sklearn.datasets.mldata import fetch_mldata
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

... be sure to change 'data_home' for your preferred location (directory).

Here is a script:

#!/usr/bin/python
# coding: utf-8

# Source:
# https://stackoverflow.com/questions/19530383/how-to-use-datasets-fetch-mldata-in-sklearn
# ... modified, below, by Victoria

"""
pers. comm. (Jan 27, 2016) from MLdata.org MNIST dataset contactee "Cheng Ong:"

    The MNIST data is called 'mnist-original'. The string you pass to sklearn
    has to match the name of the URL:

    from sklearn.datasets.mldata import fetch_mldata
    data = fetch_mldata('mnist-original')
"""

def get_data():

    """
    Get MNIST data; returns a dict with keys 'train' and 'test'.
    Both have the keys 'X' (features) and 'y' (labels)
    """

    from sklearn.datasets.mldata import fetch_mldata

    mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

    x = mnist.data
    y = mnist.target

    # Scale data to [-1, 1]
    x = x/255.0*2 - 1

    from sklearn.cross_validation import train_test_split

    x_train, x_test, y_train, y_test = train_test_split(x, y,
        test_size=0.33, random_state=42)

    data = {'train': {'X': x_train, 'y': y_train},
            'test': {'X': x_test, 'y': y_test}}

    return data

data = get_data()
print '\n', data, '\n'

score 0 · Answer 8 · answered Apr 06 '17 at 08:15

0

If you didn't give the data_home, program look the ${yourprojectpath}/mldata/minist-original.mat you can download the program and put the file the correct path

answered Apr 06 '17 at 08:15

mcolak

609
1
7
13

score 0 · Answer 9 · answered Apr 14 '18 at 08:47

I also had this problem in the past. It is due to the dataset is quite large (about 55.4 mb), I run the "fetch_mldata" but because of the internet connection, it took awhile to download them all. I did not know and interrupt the process.

The dataset is corrupted and that why the error happened.

score 0 · Answer 10 · answered Jul 25 '18 at 08:13

Apart from what @szymon has mentioned you can alternatively load dataset using:

from six.moves import urllib
from sklearn.datasets import fetch_mldata

from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
    content = response.read()
    f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
    "data": mnist_raw["data"].T,
    "target": mnist_raw["label"][0],
    "COL_NAMES": ["label", "data"],
    "DESCR": "mldata.org dataset: mnist-original",
}

score -1 · Answer 11 · answered Oct 23 '13 at 00:01

-1

That's 'MNIST original'. With a lowercase on "o".

answered Oct 23 '13 at 00:01

Lucas Ribeiro

6,132
2
25
28

Hi, thanks for your reply. Tried with small 'o' as well, still the same error. – Patthebug Oct 23 '13 at 19:38
1

Using lowercase "o" or uppercase does not make a different. Internally, sklearn [makes everything lowercase](https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/datasets/mldata.py#L33): `dataname.lower().replace(' ', '-')`. – Ricardo Magalhães Cruz Mar 31 '17 at 09:49

How to use datasets.fetch_mldata() in sklearn?

11 Answers11

Linked

Related