5

Is it possible to load a numpy.memmap without knowing the shape and still recover the shape of the data?

data = np.arange(12, dtype='float32')
data.resize((3,4))
fp = np.memmap(filename, dtype='float32', mode='w+', shape=(3,4))
fp[:] = data[:]
del fp
newfp = np.memmap(filename, dtype='float32', mode='r', shape=(3,4))

In the last line, I want to be able not to specify the shape and still get the variable newfp to have the shape (3,4), just like it would happen with joblib.load. Is this possible? Thanks.

ali_m
  • 71,714
  • 23
  • 223
  • 298
Michael
  • 1,834
  • 2
  • 20
  • 33

3 Answers3

13

Not unless that information has been explicitly stored in the file somewhere. As far as np.memmap is concerned, the file is just a flat buffer.

I would recommend using np.save to persist numpy arrays, since this also preserves the metadata specifying their dimensions, dtypes etc. You can also load an .npy file as a memmap by passing the memmap_mode= parameter to np.load.

joblib.dump uses a combination of pickling to store generic Python objects and np.save to store numpy arrays.


To initialize an empty memory-mapped array backed by a .npy file you can use numpy.lib.format.open_memmap:

import numpy as np
from numpy.lib.format import open_memmap

# initialize an empty 10TB memory-mapped array
x = open_memmap('/tmp/bigarray.npy', mode='w+', dtype=np.ubyte, shape=(10**13,))

You might be surprised by the fact that this succeeds even if the array is larger than the total available disk space (my laptop only has a 500GB SSD, but I just created a 10TB memmap). This is possible because the file that's created is sparse.

Credit for discovering open_memmap should go to kiyo's previous answer here.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • The thing is I am dealing with very big data and `memmap` avoids filling all the RAM. I am also using it with `joblib.Parallel` to parallely write to disk. – Michael Apr 20 '16 at 17:09
  • As I said, you can also open an `.npy` file as a memory-mapped array by passing the `memmap_mode=` parameter to `np.load`. Another option would be to use a combination of `joblib.dump` and `joblib.load` with the `memmap_mode=` parameter, which uses `np.save` and `np.load` under the hood. – ali_m Apr 20 '16 at 17:12
  • Suppose I need to initialize 100 GB of data, and I only have 32 GB of RAM. In that situation, I am forced to use `memmap` in write mode. Now, `np.load` doesn't work in such case: I have to read it again using `memmap` in read mode. The question is, how to do this without knowing the shape of the data, and still get the shapes right. – Michael Apr 20 '16 at 17:51
  • 2
    I see - you should include this information in your question. In that case you can use `numpy.lib.format.open_memmap` to initialize an empty memory-mapped array backed by a `.npy` file (see [this previous answer](http://stackoverflow.com/q/4335289/1461210)). – ali_m Apr 20 '16 at 18:21
  • @ali_m So why `np.memmap` exists? I still can't understand why one should prefer `np.memmap` over `np.load`. – ado sar May 23 '23 at 13:21
  • One difference is that `np.load` can only be used to load an existing file, whereas `np.memmap` (and `open_memmap)` can create new files. `np.memmap` is more low-level, in that it just maps a flat buffer into memory. `np.load` and `open_memmap` operate on `.npy` files that have headers describing the shape and dtype of the elements, which may or may not be necessary depending on your use-case. – ali_m May 26 '23 at 10:31
2

The answer from @ali_m is perfectly valid. I would like to mention my personal preference, in case it helps anyone. I always begin my memmap arrays with the shape as the first 2 elements. Doing this is as simple as:

# Writing the memmap array
fp = np.memmap(filename, dtype='float32', mode='w+', shape=(3,4))
fp[:] = data[:]
fp = np.memmap(filename, dtype='float32', mode='r+', shape=(14,))
fp[2:] = fp[:-2]
fp[:2] = [3, 4]
del fp

Or simpler still:

# Writing the memmap array
fp = np.memmap(filename, dtype='float32', mode='w+', shape=(14,))
fp[2:] = data[:]
fp[:2] = [3, 4]
del fp

Then you can easily read the array as:

#reading the memmap array
newfp = np.memmap(filename, dtype='float32', mode='r')
row_size, col_size = newfp[0:2]
newfp = newfp[2:].reshape((row_size, col_size))
Community
  • 1
  • 1
Rahul Murmuria
  • 428
  • 1
  • 3
  • 16
  • 2
    That's fine so long as you only ever use 2D arrays with a fixed dtype (also you really ought to be storing array dimensions as integers rather than floats). The main advantage to using `np.save` or `numpy.lib.format.open_memmap` is that these automatically store metadata specifying the shape and the dtype of the array. – ali_m Jul 29 '16 at 10:15
0

An alternative to numpy.memmap is tifffile.memmap:

from tifffile import memmap
newArray = memmap("name", shape=(3,3), dtype='uint8')
newArray[1,1] = 11
del(newArray)

newArray file is created having values:

0  0  0
0  11 0
0  0  0  

Now lets read it back:

array = memmap("name", dtype='uint8')
print(array.shape) # prints (3,3)
print(array)

prints:

0  0  0
0  11 0
0  0  0
mercury0114
  • 1,341
  • 2
  • 15
  • 29