I have a huge hdf5 file (~100GB, contiguous storage) that I need random access to different points. Using indexing in python/h5py or C/H5Dread seems to be very slow, thus I want to directly mmap the data.
In fact, this works in h5py/numpy on my local 64 bit Fedora 25, following this. But on a remote cluster, numpy/mmap fails for large files ([Errno 12] Cannot allocate memory
), even though the python seems to be 64 bit and simple test of 100GB files with mmap in C works. So there might be something wrong with my cluster's Python.
One solution I see is to use mmap in C. I wrote a small test to create a small hdf5 with a 1d dataset and get the dataset offset using `H5Dget_offset'. However, the results are not correct.
Following are the core codes:
/* Get dataset offset within file */
file_id = H5Fopen (FILE, H5F_ACC_RDONLY, H5P_DEFAULT);
dataset_id = H5Dopen2(file_id, "/dset", H5P_DEFAULT);
offset = H5Dget_offset(dataset_id);
fd = open(FILE, O_RDONLY);
// align with page size
pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
length = NX * NY * sizeof(int);
addr = mmap(NULL, length + offset - pa_offset, PROT_READ,
MAP_PRIVATE, fd, pa_offset);
Discussions under this blog mentioned the implementation in Julia to achieve this through H5Fget_vfd_handle
and H5Dget_offset
, but I haven't found a detailed/easy explanation.
- The offset I got through python/h5py's
dataset.id.get_offset
is identical to that I got throughH5Dget_offset
in C. - I think my core question is: how to use the offset given by C's
H5Dget_offset
to mmap the dataset. - Should
mmap
be much faster than naive hdf5 access in the first place?