dictionary[key]
is a dataset on the file. dictionary[key][...]
will be an numpy array, that dataset downloaded.
I imagine
sample[key] = dictionary[key]
is evaluated as
sample[key,...] = dictionary[key][...]
The dataset is downloaded, and then copied to a slice of the sample
array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.
You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what
list(dictionary.values)
contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...)
on that list will have to used the downloaded arrays.
One thing puzzles me - how can you use the same key
to index the first dimension of sample
and dataset in dictionary
? h5py
keys are supposed to be strings, not integers.
Some testing
Note that I'm using string dataset names:
In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))
Your np.concatenate(list(dictionary.values))
code is missing ()
:
In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]:
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
<HDF5 dataset "1": shape (2, 3), type "<f8">,
<HDF5 dataset "2": shape (2, 3), type "<f8">,
<HDF5 dataset "3": shape (2, 3), type "<f8">]
So it's just a list of the datasets. The downloading occurs when concatenate
does a np.asarray(a)
for each element of the list:
In [28]: np.concatenate(list(f.values()))
Out[28]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 1., 1.],
[1., 1., 1.],
[0., 1., 2.],
[3., 4., 5.]])
e.g.:
In [29]: [np.array(a) for a in f.values()]
Out[29]:
[array([[0., 0., 0.],
[0., 0., 0.]]), array([[0., 0., 0.],
[0., 0., 0.]]), array([[1., 1., 1.],
[1., 1., 1.]]), array([[0., 1., 2.],
[3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
....
Let's look at what happens when using your iteration approach:
Make an array that can takes one dataset for each 'row':
In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
...: v = d[...]
...: print(v.__array_interface__['data']) # databuffer location
...: samples[i,...] = v
...:
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]:
array([[[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.]],
[[0., 1., 2.],
[3., 4., 5.]]])
In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.
These are small arrays in a interactive ipython
session. I don't know if these observations apply in large cases.