2

I have several big HDF5 file stored on an SSD (lzf compressed file size is 10–15 GB, uncompressed size would be 20–25 GB). Reading the contents from such a file into RAM for further processing takes roughly 2 minutes per file. During that time only one core is utilized (but to 100%). So I guess the decompression part running on CPU is the bottleneck and not the IO throughput of the SSD.

At the start of my program it reads multiple files of that kind into RAM, which takes quite some time. I like to speed up that process by utilizing more cores and eventually more RAM, until the SSD IO throughput is the limiting factor. The machine I'm working on has plenty resources (20 CPU cores [+ 20 HT] and 400 GB RAM) and »wasting« RAM is no big deal, as long as it is justified by saving time.

I had two ideas on my own:

1) Use python's multiprocessing module to read several files into RAM in parallel. This works in principle, but due to the usage of Pickle within multiprocessing (as stated here), I hit the 4 GB serialization limit:

OverflowError('cannot serialize a bytes object larger than 4 GiB').

2) Make several processes (using a Pool from the multiprocessing module) open the same HDF5 file (using with h5py.File('foo.h5', 'r') as h_file:), read an individual chunk from it (chunk = h_file['label'][i : i + chunk_size]) and return that chunk. The gathered chunks will then be concatenated. However, this fails with an

OSError: Can't read data (data error detected by Fletcher32 checksum).

Is this due to the fact, that I open the very same file within multiple processes (as suggested here)?


So my final question is: How can I read the content of the .h5 files faster into main memory? Again: »Wasting« RAM in favor of saving time is permitted. The contents have to reside in main memory, so circumventing the problem by just reading lines, or fractions, is not an option. I know that I could just store the .h5 files uncompressed, but this is just the last option I like to use, since space on the SSD is scarce. I prefer haven both, compressed files and fast read (ideally by better utilizing the available resources).

Meta information: I use python 3.5.2 and h5py 2.8.0.


EDIT: While reading the file, the SSD works with a speed of 72 MB/s, far from its maximum. The .h5 files were created by using h5py's create_dataset method with the compression="lzf" option.

EDIT 2: This is (simplified) the code I use to read the content of a (compressed) HDF5 file:

def opener(filename, label): # regular version
    with h5py.File(filename, 'r') as h_file:
        data = g_file[label][:]
    return data

def fast_opener(filename, label): # multiple processes version
    with h5py.File(filename, 'r') as h_file:
        length = len(h_file[label])
    pool = Pool() # multiprocessing.Pool and not multiprocessing.dummy.Pool
    args_iter = zip(
        range(0, length, 1000),
        repeat(filename),
        repeat(label),
    )
    chunks = pool.starmap(_read_chunk_at, args_iter)
    pool.close()
    pool.join()
    return np.concatenate(chunks)

def _read_chunk_at(index, filename, label):
    with h5py.File(filename, 'r') as h_file:
        data = h_file[label][index : index + 1000]
    return data

As you can see, the decompression is done by h5py transparently.

user3389669
  • 799
  • 6
  • 20
  • Have a look at `iotop` if you are on Linux (assumed). Make sure your disk IO will not be bottleneck. Otherwise no matter how many processes you create will not speed up loading. – knh190 Mar 22 '19 at 10:02
  • How is the decompression done? It sees to me the problem is the LZF decompression runs single-core, independently by the fact you're using HDF5 files. – GPhilo Mar 22 '19 at 10:11
  • Since this is an IO-bound workload and the reading and (de)compression presumably happens in C code, use `threading.Thread`s (or `multiprocessing.dummy.Pool` if you like the `multiprocessing` API). It shouldn't be bound by the GIL, as Python threads usually are. – AKX Mar 22 '19 at 10:30
  • @user3389669 can you add the code that loads the file in memory? Are you explicitly decompressing or is it handled by h5py? – GPhilo Mar 22 '19 at 10:43
  • I think method 2 should work. Is it possible that you had the file open in the parent process when creating the workers? I [discovered](https://github.com/h5py/h5py/issues/934#issuecomment-497674285) that the workers can end up sharing the file handle, leading to strange errors as they all try to seek and read with the same file descriptor. – Thomas K Jun 13 '19 at 09:50
  • @ThomasK As far as I remember, the file was not open in the parent process when the workers tried to open it (however, I can't exactly reconstruct it anymore). As can be seen in the listing above, the workers are outside of the file opening context manager – so I assume the file is closed before the workers start. – user3389669 Jun 14 '19 at 07:35

1 Answers1

3

h5py handles decompression of LZF files via a filter. The source code of the filter, implemened in C, is available on the h5py Github here. Looking at the implementation of lzf_decompress, which is the function causing your bottleneck, you can see it's not parallelized (No idea if it's even parallelizable, I'll leave that judgement to people more familiar to LZF inner workings).

With that said, I'm afraid there's no way to just take your huge compressed file and multithread-decompress it. Your options, as far as I can tell, are:

  • Split the huge file in smaller individually-compressed chunks, parallel-decompress each chunk on a separate core (multiprocessing might help there but you'll need to take care about inter-process shared memory) and join everything back together after it's decompressed.
  • Just use uncompressed files.
GPhilo
  • 18,519
  • 9
  • 63
  • 89