Python: Alternative way to avoid memoryerror when using numpy array?

Question

I am new to python and started using numpy. I am following an algorithm from paper and with my dataset it requires an array of dimension 1million * 1million.

The exact code is larray(np.random.normal(0, sigma**2, size=(794832, 794832))

Although I have a 16GB ram, numpy tries to load the array into memory while creation and hence, I am getting memoryerror. I tried using lazy initialisation with lazyarray and still not working.

Is there any way to create an array which uses file system rather than ram?

Thanks in advance.

You need more than 4 TB of RAM. I don't think you have enough swap space. — Daniel, Jun 04 '17 at 18:00
See https://stackoverflow.com/questions/1053928/very-large-matrices-using-python-and-numpy — clockwatcher, Jun 04 '17 at 18:02
There's no way you're going to be to store an array with a trillion elements on a consumer class PC. Even if you used secondary memory you'd need a terabyte class hard disk. What are you actually trying to do with the 10^6 x 10^6 array? This may be an XY problem — JacaByte, Jun 04 '17 at 18:03
@clockwatcher Even if you stored a trillion element array on a hard disk you'd need terabytes of storage. Compression won't help either because it's random data. — JacaByte, Jun 04 '17 at 18:06
Thanks all for your comment. I forgot to mention that I am new to data science field and I have a data from db. Also, I made a mistake in dimension. It is `4000 * 794832`. Still I need an terabyte space? — Selva, Jun 04 '17 at 18:10
@Daniel. Thanks. Is It possible to get random.normal in pytable? — Selva, Jun 04 '17 at 18:19

iblasi · Answer 1 · 2017-06-04T21:52:20.923

The size of the data you are creating will depend on the matrix size and the precision-type of the data.

You are trying to use np.random.normal that creates a matrix with float64 precision type values. The 64 number means that your are using 64 bits for each number, so each number will require a memory of 8 bytes (8bits per byte). If your matrix has a shape/dimension of 4000x794832, that means you need ~23.7GB [4000*794832*8] of memory allocation.

If you have a 16GB RAM it shouldn't be enough, so as it will use SWAP (if enough defined) it may take some time to create it, or just run out of memory.

The question is, do you need a float64 precision? Because it seems to be much for usual scientist developments. So maybe to speed-up any following mathematical operations, you can consider to change your matrix precision type to float16 for example [4000*794832*2].

import numpy as np
a = np.random.normal(0, 0.7**2, size=(4000,794832))
a.nbytes   # will give a size of 25434624000 [~23.7GB] (huge number)
b = np.random.normal(0, 0.7**2, size=(4000,794832)).astype(np.float16)
b.nbytes   # will give a size of 6358656000 [~5.9GB](big but at least you can do everything on RAM)

The problem on this case is that np.random.normal hasn't got option to define directly numpy dtype, so you will create a float64 matrix and then convert it, which is not a very efficient option. But if haven't got any other choice...

Python: Alternative way to avoid memoryerror when using numpy array?

1 Answers1