I have a dataset with 707 columns and 3947 rows. From this, I calculate a 707x707 covariance matrix, and an array or row means using numpy.cov and pandas.df.mean respectively.
When I use this covariance matrix and set of means to try to generate a random multivariate normal toy dataset using numpy, I get back a MemoryError.
How could I generate a random dataset this large with these specifications without getting this error?
Edit:
Here's my stack trace:
Traceback (most recent call last):
File "<ipython-input-28-701051dd6b16>", line 1, in <module>
runfile('/project/home17/whb17/Documents/project2/scripts/mltest/covex.py', wdir='/project/home17/whb17/Documents/project2/scripts/mltest')
File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "/project/soft/linux64/anaconda/Anaconda3-5.0.1-Linux-x86_64/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/project/home17/whb17/Documents/project2/scripts/mltest/covex.py", line 36, in <module>
d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows], check_valid='ignore').T
File "mtrand.pyx", line 4538, in mtrand.RandomState.multivariate_normal
MemoryError
Edit 2:
And here's the code that causes it:
X = pd.read_csv('../../data/mesa/MESA.csv', sep=',', header=None, index_col=0)
n_cols, n_rows = X.shape
means = X.mean(axis=0).tolist()
X_cov = np.cov(X.T)
d2_x, d2_y = multivariate_normal(means, X_cov, [n_cols, n_rows]).T