MemoryError when generating numpy.MultivariateNormal

Question

I have a dataset with 707 columns and 3947 rows. From this, I calculate a 707x707 covariance matrix, and an array or row means using numpy.cov and pandas.df.mean respectively.

When I use this covariance matrix and set of means to try to generate a random multivariate normal toy dataset using numpy, I get back a MemoryError.

How could I generate a random dataset this large with these specifications without getting this error?

Edit:

Here's my stack trace:

Traceback (most recent call last):

  File "<ipython-input-28-701051dd6b16>", line 1, in <module>
    runfile('/project/home17/whb17/Documents/project2/scripts/mltest/covex.py', wdir='/project/home17/whb17/Documents/project2/scripts/mltest')

  File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/project/home17/whb17/Documents/project2/scripts/mltest/covex.py", line 36, in <module>
    d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows], check_valid='ignore').T

  File "mtrand.pyx", line 4538, in mtrand.RandomState.multivariate_normal

MemoryError

Edit 2:

And here's the code that causes it:

X = pd.read_csv('../../data/mesa/MESA.csv', sep=',', header=None, index_col=0)

n_cols, n_rows = X.shape

means = X.mean(axis=0).tolist()

X_cov = np.cov(X.T)

d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows]).T

Also, are you sure the matrix shouldn't be 707 X 707? Usually columns are variables, and rows are instances. — Ami Tavory, May 18 '18 at 13:46
Thanks for pointing out that issue. I've edited the original post accordingly, but I still have the MemoryError. — Sartorible, May 18 '18 at 14:08
Could you write the exact command + stacktrace that you see? It's possible you're using it the wrong way. These sizes (including the original ones) are just not very large. — Ami Tavory, May 18 '18 at 14:09

Ami Tavory · Accepted Answer · 2018-05-18T14:55:23.663

1

From your code, it seems very likely that you've misinterpreted the use of multivariate_normal in

d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows]).T

The first and second parameters here are the means and covariance. The third parameter is the shape of the matrix every cell of which should be an instance of the random matrix. This is not something whose transpose is a pair, and is almost certainly not what you want.

Just as an example, if the dimensions of X_cov are 707 X 707, then the dimensions of the result are 707 X 707 X 707 X n_rows.

To generate a toy dataset, you should use

multivariate_normal(means, X_cov, n_rows)

The overall result, compared to your original question (before the first edit), should be smaller by about 1 / 1250000.

edited May 18 '18 at 14:55

answered May 18 '18 at 14:34

Ami Tavory

74,578
11
141
185

Thanks, but now I'm getting `TypeError: 'multivariate_normal_frozen' object is not iterable` – Sartorible May 18 '18 at 15:11
If you're trying to assign it to a pair, then, yeah - you'll get that error. Each time you call it, it will generate a single dataset with the dimensions of your original one. If you want two, I suggest you call it twice. – Ami Tavory May 18 '18 at 15:13

MemoryError when generating numpy.MultivariateNormal

1 Answers1