handling big numpy array

Question

I have a big 3D numpy array that is created inside my program as a product of two other arrays (ie it is not getting created after parsing a file on my disk). The array can have dimensions 250000-by-270-by-100 (or even bigger), hence the memory footprint is around 50GB since 250000 * 270 * 100 * 8 /(1024**3) = 50.3GB

The array is accessed multiple times until a chain has converged. Also, every loop of the chain mutates the array.

Depending on the memory installed on the pc that runs the code it may crash. I would definitely like to make that lighter in any case. Currently the quick fix is to change the datatype, from float64 to float32 for example.

Saving the array on the disk and then mmap() it, maybe is an option but havent tried it because I think it will make it a lot slower.

Do you guys know what other options I have please? How can I handle an array that big? (maybe dask but dont really know...)

`250000 * 270 * 100 * 8 /(1024**3) = 50.3` **GB** or **MB**? — Quang Hoang, Dec 14 '20 at 14:53
@Aenaon your calculation gives 50 gigabytes, not megabytes, which explains the crash. 50.3 MiB would be nothing for a modern PC. — Jussi Nurminen, Dec 14 '20 at 14:53
Oops! yes, indeed. Typo, that was supposed GB not MB. Thanks just edited it — Aenaon, Dec 14 '20 at 15:11
Can you describe what operation you are performing? Without knowing what is going on, it is really hard to tell if there is room for improvement... — Nils Werner, Dec 14 '20 at 15:23
There are lots of in-place operations you can do, so if you show an MCVE, that would help — Mad Physicist, Dec 14 '20 at 15:29
Can you describe what your matrices look like? If they have a lot of zero's, maybe you could use sparse matrices? — nablag, Dec 14 '20 at 15:42

score 1 · Answer 1 · answered Dec 14 '20 at 15:55

dask is definitely a way to solve your problem. However, it require you to modify your code by a lot. There are two more ways that I think you can try.

Use numpy.memmap to create arrays directly mapped into a file. However, it will be slow to run when you need to access your array a lot during calculation (if you are running on HHD not SSD). Ref: Working with big data in python and numpy, not enough ram, how to save partial results on disc?
Go to cloud service like aws or gcp to rent a server with high ram capacity. They usually bill per second. That's mean you only hv to pay 5 min of use if you use it for 5 min. So if you plan your calculation carefully, you only hv to run one time. That will only cost you few dollars (for example, on aws a r5.2xlarge cost 0.504 per hr for 8 cores and 64 GB ram). Most importantly, you do not have to modify your code

handling big numpy array

1 Answers1