Concatenate Numpy arrays with least memory

Question

Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:

dictionary[0] = np.array([[[...],[...]]...])

I want to concat all these np arrays, code like

sample = np.concatenate(list(dictionary.values))

this operation waste 100GB memory! If I use

del dictionary

It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this

    sample = np.concatenate(sample,dictionary[key])

It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this

sample = np.empty(shape)
with h5py.File(...) as dictionary:
    for key in dictionary.keys():
        sample[key] = dictionary[key]

I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?

Are there any good methods to limit the memory usage as 50GB?

Maybe related: https://stackoverflow.com/questions/7869095/concatenate-numpy-arrays-without-copying — dasWesen, Aug 27 '18 at 16:37

pask · Accepted Answer · 2018-07-18T10:40:51.937

1

Your problem is that you need to have 2 copies of the same data in memory. If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.

import numpy as np
import time    

def test1(n):
    a = {x:(x, x, x) for x in range(n)} # Build sample data
    b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
    return b

def test2(n):
    a = {x:(x, x, x) for x in range(n)} # Build sample data
    b = np.concatenate(list(a.values()))
    return b

x1 = test1(1000000)
del x1

time.sleep(1)

x2 = test2(1000000)

Results:

test1 : 0.71 s
test2 : 1.39 s

The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.

edited Jul 18 '18 at 10:40

answered Jul 18 '18 at 10:34

pask

899
9
19

How to plot the RAM picture? The variable a is not necessary, I connect to h5py, and take data from it, how to pop it without putting it into dictionary? – Wong Pi Jul 18 '18 at 11:16
2

Shouldn't it be "np.array([a.pop(0) for i in range(n)]).reshape(-1)" (0, not i)? Also, it didn't work for me, still getting a memory error when using this instead of "A = np.concatenate(mylist, axis=0)" – dasWesen Aug 27 '18 at 16:40

hpaulj · Answer 2 · 2018-07-18T16:32:49.717

dictionary[key] is a dataset on the file. dictionary[key][...] will be an numpy array, that dataset downloaded.

I imagine

sample[key] = dictionary[key]

is evaluated as

sample[key,...] = dictionary[key][...]

The dataset is downloaded, and then copied to a slice of the sample array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.

You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what

list(dictionary.values)

contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...) on that list will have to used the downloaded arrays.

One thing puzzles me - how can you use the same key to index the first dimension of sample and dataset in dictionary? h5py keys are supposed to be strings, not integers.

Some testing

Note that I'm using string dataset names:

In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))

Your np.concatenate(list(dictionary.values)) code is missing ():

In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]: 
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
 <HDF5 dataset "1": shape (2, 3), type "<f8">,
 <HDF5 dataset "2": shape (2, 3), type "<f8">,
 <HDF5 dataset "3": shape (2, 3), type "<f8">]

So it's just a list of the datasets. The downloading occurs when concatenate does a np.asarray(a) for each element of the list:

In [28]: np.concatenate(list(f.values()))
Out[28]: 
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [1., 1., 1.],
       [1., 1., 1.],
       [0., 1., 2.],
       [3., 4., 5.]])

e.g.:

In [29]: [np.array(a) for a in f.values()]
Out[29]: 
[array([[0., 0., 0.],
        [0., 0., 0.]]), array([[0., 0., 0.],
        [0., 0., 0.]]), array([[1., 1., 1.],
        [1., 1., 1.]]), array([[0., 1., 2.],
        [3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
    ....

Let's look at what happens when using your iteration approach:

Make an array that can takes one dataset for each 'row':

In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
    ...:     v = d[...]
    ...:     print(v.__array_interface__['data']) # databuffer location
    ...:     samples[i,...] = v
    ...:     
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]: 
array([[[0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.]],

       [[1., 1., 1.],
        [1., 1., 1.]],

       [[0., 1., 2.],
        [3., 4., 5.]]])

In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.

These are small arrays in a interactive ipython session. I don't know if these observations apply in large cases.

list(dictionary.values) is referenced to dataset, concatenate will add new RAM . The keys in my own program have converted to int type. The iteration case is a good job. — Wong Pi, Jul 19 '18 at 09:53

Concatenate Numpy arrays with least memory

2 Answers2

Linked