20

I saved a couple of numpy arrays with np.save(), and put together they're quite huge.

Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
vedran
  • 751
  • 1
  • 5
  • 16
  • Possible duplicate: [Concatenate Numpy arrays without copying](http://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying) – Jonas Schäfer Dec 08 '12 at 19:17
  • Of course, I've tried simply np.concatenate() a tuple of memory mapped arrays, and the result is loading into memory and quite quicky crippling my system. – vedran Dec 08 '12 at 23:53
  • Reading the other thread, what you want to achieve seems quite impossible to me. Although I can really see the use. If it's just about slicing, I have one or two ideas, but these won't work with other numpy utils. – Jonas Schäfer Dec 08 '12 at 23:54
  • I guess I'll just have to live without slicing in this particular case, but you're free to share the ideas you have, of course. – vedran Dec 08 '12 at 23:58
  • 1
    Is [h5py](http://code.google.com/p/h5py/wiki/HowTo) a possibility for you? There, you can slice nicely without loading the whole thing. – cronos Dec 10 '12 at 17:21

3 Answers3

24

Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.

For any case you must choose the right order for the array (row-major or column-major).

The following examples illustrate how to concatenate along axis 0 and axis 1.


1) concatenate along axis=0

a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222

You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:

c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b

Concatenating along axis=0 does not require to pass order='C' because this is already the default order.


2) concatenate along axis=1

a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222

The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:

c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b

Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.

Related questions:

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
5

Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.

import numpy as np
import dask.array as da
 
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))

c = da.concatenate([a, b], axis=0)

This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.

Note that there are two caveats:

  1. it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
  2. this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.
DIN14970
  • 341
  • 2
  • 8
0

If u use order='F',will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F. So my solution is below, I have test a lot, it just work fine.

fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
    new_fp[:shape[0]] = fp[:]
    new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
    new_fp[:, :shape[-1]] = fp[:]
    new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
    new_fp[:, :, :shape[-1]] = fp[:]
    new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()
Eric Zhang
  • 119
  • 1
  • 5