1

I have a dataset with 707 columns and 3947 rows. From this, I calculate a 707x707 covariance matrix, and an array or row means using numpy.cov and pandas.df.mean respectively.

When I use this covariance matrix and set of means to try to generate a random multivariate normal toy dataset using numpy, I get back a MemoryError.

How could I generate a random dataset this large with these specifications without getting this error?

Edit:

Here's my stack trace:

Traceback (most recent call last):

  File "<ipython-input-28-701051dd6b16>", line 1, in <module>
    runfile('/project/home17/whb17/Documents/project2/scripts/mltest/covex.py', wdir='/project/home17/whb17/Documents/project2/scripts/mltest')

  File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/project/home17/whb17/Documents/project2/scripts/mltest/covex.py", line 36, in <module>
    d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows], check_valid='ignore').T

  File "mtrand.pyx", line 4538, in mtrand.RandomState.multivariate_normal

MemoryError

Edit 2:

And here's the code that causes it:

X = pd.read_csv('../../data/mesa/MESA.csv', sep=',', header=None, index_col=0)

n_cols, n_rows = X.shape

means = X.mean(axis=0).tolist()

X_cov = np.cov(X.T)

d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows]).T
Sartorible
  • 338
  • 1
  • 2
  • 11

1 Answers1

1

From your code, it seems very likely that you've misinterpreted the use of multivariate_normal in

d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows]).T

The first and second parameters here are the means and covariance. The third parameter is the shape of the matrix every cell of which should be an instance of the random matrix. This is not something whose transpose is a pair, and is almost certainly not what you want.

Just as an example, if the dimensions of X_cov are 707 X 707, then the dimensions of the result are 707 X 707 X 707 X n_rows.

To generate a toy dataset, you should use

multivariate_normal(means, X_cov, n_rows)

The overall result, compared to your original question (before the first edit), should be smaller by about 1 / 1250000.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks, but now I'm getting `TypeError: 'multivariate_normal_frozen' object is not iterable` – Sartorible May 18 '18 at 15:11
  • If you're trying to assign it to a pair, then, yeah - you'll get that error. Each time you call it, it will generate a single dataset with the dimensions of your original one. If you want two, I suggest you call it twice. – Ami Tavory May 18 '18 at 15:13