Is it possible to create a numpy.memmap of array of arrays?

Question

I have a (4,) arrays that I want to save to the disk (The sizes I am working with can not fit into memory so I need to dynamically load what I need). However, I want to have that in a single numpy.memmap. Not sure if it is possible but any suggestion would be greatly appreciated.

I have this without numpy.memmap

arr1 = [1,2,3,4]
arr2 = [2,3,4,5]
arr3 = [3,4,5,6]
arr4 = [4,5,6,7]
data = []
data.extend([arr1])
data.extend([arr2])
data.extend([arr3])
data.extend([arr4])
print(data)

[[1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6], [4, 5, 6, 7]]

I want to be able to do something like this:

import numpy as np

arr1 = np.memmap('./file1', np.dtype('O'), mode='w+', shape=(4,))
arr1[:] = [1,2,3,4]
arr2 = np.memmap('./file2', np.dtype('O'), mode='w+', shape=(4,))
arr2[:] = [2,3,4,5]
arr3 = np.memmap('./file3', np.dtype('O'), mode='w+', shape=(4,))
arr3[:] = [3,4,5,6]
arr4 = np.memmap('./file4', np.dtype('O'), mode='w+', shape=(4,))
arr4[:] = [4,5,6,7]
data = []
data.extend([arr1])
data.extend([arr2])
data.extend([arr3])
data.extend([arr4])
print (data)

[memmap([1, 2, 3, 4], dtype=object), memmap([2, 3, 4, 5], dtype=object), memmap([3, 4, 5, 6], dtype=object), memmap([4, 5, 6, 7], dtype=object)]

This requires me to create different files per array and I really want to have a single memmap that would handle the entire mini-arrays of 4. Can someone provide a way to do this using memmaps?

The ability to extend i.e. data.extend() is important as I don't know how many mini-arrays I have.

Careful when creating `O` dtype `memmap`. This just saves pointers to objects to the file, not the actual objects. Those objects still exist only in your current memory. I can read your `arr1` in another session only if the values are integers less than 256. Anything else gives me a segmentation fault. — hpaulj, Apr 15 '19 at 06:29
I don't think you can practically piece together memory mapped chunks from different files into a consecutive range of memory. The underlying `mmap()` API allows hinting own addresses, but it would be a pain to piece together multiple files in a reliable and robust way. Also, the granularity would be the architecture's page size. You cannot piece together any arbitrary size, only multiples of a page size, typically n * 4096 bytes. — blubberdiblub, Apr 15 '19 at 07:23
@hpaulj Thanks for the tip. I actually have very large numbers. I have it as np.ndarray right now but I want to convert it to memmap somehow. Probably using np.dtype('O') is wrong. That's what I am trying to figure out. — Kour, Apr 15 '19 at 08:12
@blubberdiblub, thanks for this insight, the mini-arrays don't need to be implemented as memmap. I am more looking for loading these mini-arrays dynamically so I don't have to load the entire list of arrays into memory as I can't (even with 64GB RAM memory) So I need a single `memmap` that have these mini-arrays somehow. — Kour, Apr 15 '19 at 08:14
@Kour I'm not aware of a numpy array type that acts as a kind of proxy to some piecewise loading/caching mechanism (which doesn't necessarily mean it doesn't exists). Maybe there's some way to solve the problem on the OS level? Maybe you can actually concatenate the array into a big file and afterwards (if necessary) split them again? Or maybe let device mapper do it for you, if you're on a linux system? — blubberdiblub, Apr 15 '19 at 08:21
@blubberdiblub, so I already organize my data into a specific format and then I use `numpy.save` to save it to the disk. the `data` from code section 1 would be wrapped in a numpy array `data = np.array([data])` and then I will save using numpy way `np.save('./filepath', data=data)` I am working with docker linux images. To be honest I am not strongly attached to memmap but I want to have a way to load the data dynamically so I can fit in the physical memory. I need to iterate over all of items I just didn't want to manage that and I was hoping memmap would solve that for me. — Kour, Apr 15 '19 at 08:38
@Kour well, you can do that with the available `memmap()` mechanism if you have all your data in one file. If that works for you, that should not be too hard. — blubberdiblub, Apr 15 '19 at 08:43
@blubberdiblub Yea that would be awesome. I tried to do a simple `data = np.load('./filepath', mmap_mode = 'r')` and getting `data_r = data['data']` and iterate over the items by doing `for attr, _ in data_r.items()` but that didn't fit in memory. It is possible that I am missing something or maybe there is another way to deal with this. — Kour, Apr 15 '19 at 08:46
@Kour don't load. Do a `np.memmap()` from the get-go. See the documentation for the `shape=` option. — blubberdiblub, Apr 15 '19 at 08:48
@blubberdiblub, I see, this gives me a couple of ideas to try. Thank you I will go and play with it a bit. Thanks for the input :) — Kour, Apr 15 '19 at 08:55
`np.save(..., data=data)` doesn't look right. `np.savez(..., data=data)` makes more sense, esp. if you use `data['data']` in the load. But `mmap_mode` doesn't work with `npz` archives. Use `mmap_mode` when you save a large multidimensional numeric array. None of this fancy object or nested array stuff. — hpaulj, Apr 15 '19 at 21:35
Have you tried loading the data into a an SQL database or hdf5? Both are rather good options when it comes to handling data with known types. — Debanjan Basu, Apr 16 '19 at 23:57

Is it possible to create a numpy.memmap of array of arrays?

0 Answers0