0

Currently, I load HDF5 data in python via h5py and read a dataset into memory.

f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]

This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:

f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]

I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.

My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.

I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.

Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Please add some thinngs: 1) from which device are you reading from (SSD,HDD,NAS) 2) chunk-size 3)compression algorithm 4)working system 5) most important: Which speed do you obtain in both cases. Such performance problems can really depend on many things... – max9111 Jan 25 '18 at 14:35

1 Answers1

1

I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.

If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.

I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def Reading():
    File_Name_HDF5='Test.h5'

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r',driver='core')
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r')
    dset = f['Test'][:]
    print(time.time()-t1)
    f.close()

if __name__ == "__main__":
    Reading()

This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)

max9111
  • 6,272
  • 1
  • 16
  • 33
  • Is it possible to set chunk-cache-size with low-level h5py interface rather than h5py_cache library? Unfortunately, we don't have access to h5py_cache. – jpp Jan 25 '18 at 15:26
  • Yes it is. Take a look at the __init__.py in the h5py-cache package. It is actually only a quite simple pure Python wrapper for h5py low-level functions. The aim of the package is to simplify things for the user and it would be really great, if it would be implemented in the official h5py package in the future. It's actually only 75 lines of code with lots of comments.... – max9111 Jan 25 '18 at 15:35
  • I'll have a look. So far I've avoided it since I can't access `h5py_cache` but this could yield massive improvements. I agree it should be implemented in h5py. – jpp Jan 25 '18 at 15:38