1

I'm trying to run PCA on the MNIST data (just messing around with it trying to learn some ML stuff), but get a memory allocation error that seems way too small for my machine. I've tried two slightly different codes, the following copied from this website: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60, (where I managed to run PCA on the Iris dataset absolutely fine).

But, when I go to run the following:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)


from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)


pca.fit(train_img)

I get the error:

Traceback (most recent call last):
  File "C:\...\Python\pca_mnist_new.py", line 12, in <module>
    scaler.fit(train_img)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\preprocessing\_data.py", line 667, in fit
    return self.partial_fit(X, y)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\preprocessing\_data.py", line 762, in partial_fit
    _incremental_mean_and_var(X, self.mean_, self.var_,
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\extmath.py", line 765, in _incremental_mean_and_var
    new_sum = _safe_accumulator_op(np.nansum, X, axis=0)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\utils\extmath.py", line 711, in _safe_accumulator_op
    result = op(x, *args, **kwargs)
  File "<__array_function__ internals>", line 5, in nansum
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\nanfunctions.py", line 649, in nansum
    a, mask = _replace_nan(a, 0)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\nanfunctions.py", line 109, in _replace_nan
    a = np.array(a, subok=True, copy=True)
MemoryError: Unable to allocate 359. MiB for an array with shape (60000, 784) and data type float64
[Finished in 29.868s]

(I get a similar error, with slightly different preamble when I run the code I made previously with data already loaded:

Traceback (most recent call last):
  File "C:\...\Python\pca_MNIST.py", line 36, in <module>
    pca.fit(x)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 351, in fit
    self._fit(X)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 423, in _fit
    return self._fit_full(X, n_components)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\sklearn\decomposition\_pca.py", line 454, in _fit_full
    U, S, V = linalg.svd(X, full_matrices=False)
  File "C:\...\Local\Programs\Python\Python38-32\lib\site-packages\scipy\linalg\decomp_svd.py", line 128, in svd
    u, s, v, info = gesXd(a1, compute_uv=compute_uv, lwork=lwork,
MemoryError: Unable to allocate 359. MiB for an array with shape (60000, 784) and data type float64
[Finished in 2.792s]

but both have exactly the same error at the bottom.)

I'm on Windows 10, running this code in Atom, but I get the same error running this from command line, with everything else closed. I have 16 GB of RAM.

I understand that MiB is Mebibyte, and 359 of them seem too small for an allocation error with 16GB of ram, but this is where my limited expertise and frustrated googling leave me at a loss.

I see from this: https://stackoverflow.com/questions/44508254/increasing-memory-limit-in-python#:~:text=Python%20doesn't%20limit%20memory,what%20you're%20looking%20for., that Python simply allocates as much memory as it can until there is none left.

Is it possible that the PCA function is using all of this memory and this error is simply for the array that broke the camel's back? My intuition says no, but I'm really out of my depth at this point.

Any way to get this working so I can go play with some lower dimensional data? Or am I going to have to take a detour and write something to do this manually?

1 Answers1

0

A simple workaround you should definitely try is reducing the floating point precision. float64 seems excessive, even neural networks don't use that kind of precision.

import numpy as np

train_img = train_img.astype(np.float32)  # or even np.float16

Try this for test_img too.

Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • ` MemoryError: Unable to allocate 359. MiB for an array with shape (60000, 784) and data type float64 ` Still get this error, I believe it is the PCA or StandardScaler function that is creating the float64 array – Sam Matthews Jun 21 '20 at 14:02