1

I have a bunch of large matrices (n x d, d << n) saved as binary files with numpy's memmap class in an AWS elastic file system. I need to be able to quickly pick a matrix, and randomly sample m rows (m << n) in my deployments in ec2, which have the elastic file system mounted. From my understanding with memmaps, this should only require that I load those m rows into memory, so it would be much faster than just reading the entire matrix and loading it into memory.

My main concern is that this operation takes much longer than I expected. For a matrix of shape (1m x 1024), it takes me 30 seconds just to load 5,000 rows. I also found that sampling from the memory map file on disk is much, much faster than reading a random sample of rows from the efs mount.

Can someone explain why this is happening? I can see that it's not a problem with throughput from the EFS, in fact it only takes a few seconds to copy the entire memory map to disk (at which point I believe it's cached into RAM on the local OS of the deployment), and then less than a second to sample the rows that I need. Why can't I sample directly from the mounted efs? Could it be that because it's an NFS mount, we're loading way more than we need to?

Edmond Lee
  • 11
  • 3
  • You are likely having an issue with high latency. This is a example using h5py (where you can speed up the operation using chunked storage): https://stackoverflow.com/a/44961222/4045774 It may make sense to use h5py here (with the right chunk-shape und chunk-cache). – max9111 Aug 09 '21 at 07:52

0 Answers0