Parallel reading in hdf5

Question

My C++/C program read hundreds of compressed compound arrays from a hdf5 file sequentially and stores them in some vectors. I would like to improve its time performance. I wish I can read 3 or 4 of them in parallel, and then again next 3 or 4, etc.. I am totally new to multithreading or OpenMP or any parallel programming. My question is: - Is it possible to implement on hdf5/C/C++/Linux what I want? - If so, can you direct me to some info or tutorial for beginners? Thank you With respect Nyama

score 4 · Answer 1 · edited May 23 '17 at 10:29

HDF5 technically has a thread-safe mode, but it serializes all library calls so there's no performance benefit (see the link). Depending on your application, you can use fork to create parallel processes instead of parallel threads. If you take this approach, you may need to use interprocess communication (IPC) to transfer the data back to the main process.

Note that whether any of these parallel reading approaches gives any benefit depends a lot on how the HDF5 files are stored on disk. If they're sitting on a standard 7200 RPM disk, you'll probably make things much slower by trying to do parallel reads because you'll start seeking all over the file instead of nicely streaming out contiguous chunks (assuming your disk is not already very fragmented). On the other hand, if the data are on a more advanced file server, on an SSD with a good controller, or on a RAID array, there's a better chance you'll see a benefit. I suggest first doing some profiling to see if the time is being spent doing real filesystem I/O (in which case you need better disk or to spread your data across multiple disks), decompression (multithreading or multiprocessing is more likely to be a big help if this is the bottleneck), or other operations.

Ok. The program is spending time on hdf5 dataset reading. For example: For the dataset 17gsA reading, my program with uncompressed hdf5 database time=48 sec, size of 17gsA=50MB; my program with compressed hdf5 database time=29 sec, size of 17gsA=13MB, original program with compressed text database time=25 sec, size=10MB. The hdf5 was not giving good performance to compress variable compound array, so I made my dataset fixed compound as result you see some difference on their sizes. Time is clearly proportional of their sizes. So I think the time spent mostly loading data from hard disk to RAM. — user3517697, Apr 10 '14 at 12:15
50 MiB / 48 seconds is a pretty low raw data rate. Are you using a networked filesystem or local disk? A defragmented 7200 RPM local disk should be able to sequentially read at about 100 MiB/s. How long does "time cat $your_file" take on the command-line (from the same machine where you normally run your HDF5 job)? — Mr Fooz, Apr 10 '14 at 16:17

Parallel reading in hdf5

1 Answers1