2

I have a big 3D numpy array that is created inside my program as a product of two other arrays (ie it is not getting created after parsing a file on my disk). The array can have dimensions 250000-by-270-by-100 (or even bigger), hence the memory footprint is around 50GB since 250000 * 270 * 100 * 8 /(1024**3) = 50.3GB

The array is accessed multiple times until a chain has converged. Also, every loop of the chain mutates the array.

Depending on the memory installed on the pc that runs the code it may crash. I would definitely like to make that lighter in any case. Currently the quick fix is to change the datatype, from float64 to float32 for example.

Saving the array on the disk and then mmap() it, maybe is an option but havent tried it because I think it will make it a lot slower.

Do you guys know what other options I have please? How can I handle an array that big? (maybe dask but dont really know...)

Aenaon
  • 3,169
  • 4
  • 32
  • 60

1 Answers1

1

dask is definitely a way to solve your problem. However, it require you to modify your code by a lot. There are two more ways that I think you can try.

  1. Use numpy.memmap to create arrays directly mapped into a file. However, it will be slow to run when you need to access your array a lot during calculation (if you are running on HHD not SSD). Ref: Working with big data in python and numpy, not enough ram, how to save partial results on disc?

  2. Go to cloud service like aws or gcp to rent a server with high ram capacity. They usually bill per second. That's mean you only hv to pay 5 min of use if you use it for 5 min. So if you plan your calculation carefully, you only hv to run one time. That will only cost you few dollars (for example, on aws a r5.2xlarge cost 0.504 per hr for 8 cores and 64 GB ram). Most importantly, you do not have to modify your code

Wilson.L
  • 91
  • 3