Currently, I load HDF5 data in python via h5py and read a dataset into memory.
f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]
This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:
f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]
I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.
My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.
I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.
Update: Even if what I am asking is not possible, can someone explain why using driver='core'
has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core
driver?