6

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.

I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:

mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
   auxData = load_data_from(file[i1])
   mmapData[i1,:] = auxData
   mmapData.flush() % Do this every 10 iterations or so

However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:

mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]

I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?

Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?

Community
  • 1
  • 1
KartMan
  • 369
  • 3
  • 19
  • I haven't reproduced it yet, but this seems surprising to me too. I thought slices were just fat pointers... Are you sure execution isn't going past the aux1 assignment? If it is going past the assignment, and you're reading that, more memory would be loaded than you'd expect due to cache lines being bigger than 5 bytes (usually 64 bytes), depending on whether you are using order='C' or 'F'. But that still shouldn't be enough to hose a machine with 8GB available RAM, so this isn't an answer. – Andrew Wagner Feb 22 '16 at 20:13

1 Answers1

1

Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

Andrew Wagner
  • 22,677
  • 21
  • 86
  • 100