HDF5 possible data corruption or loss?

Question

On wikipedia one can read the following criticism about HDF5:

Criticism of HDF5 follows from its monolithic design and lengthy specification. Though a 150-page open standard, there is only a single C implementation of HDF5, meaning all bindings share its bugs and performance issues. Compounded with the lack of journaling, documented bugs in the current stable release are capable of corrupting entire HDF5 databases. Although 1.10-alpha adds journaling, it is backwards-incompatible with previous versions. HDF5 also does not support UTF-8 well, necessitating ASCII in most places. Furthermore even in the latest draft, array data can never be deleted.

I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?

I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.

My scripts always have a Python API, hence I am using h5py (version 2.5.0).

So, is this criticism relevant to me and should I be concerned about corrupted data?

First, all other implementations rely on C library, so these are issues everywhere. For data loss, I think the critical point is adding data to already existing file. But if You write just one file at a time, then obviously You can check if writing was successful before deleting data, and it shouldn't be a problem. But I'n not an expert and also would like to see other opinions. — kakk11, Mar 10 '16 at 08:05
The critique comes from a discussion on [hackernews](https://news.ycombinator.com/item?id=10858189) [skynetv2](https://news.ycombinator.com/item?id=10860496) points out that _"A crash may result in corruption but there is no high risk of corruption"_ — schoetbi, May 24 '17 at 08:32

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

Declaration up front: I help maintain h5py, so I probably have a bias etc.

The wikipedia page has changed since the question was posted, here's what I see:

Criticism

Criticism of HDF5 follows from its monolithic design and lengthy specification.

Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.

HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.

Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).

I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).

I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).

I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.

Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.

I like compression and hdf5's typing though, would it be possible to add journaling? — Carbon, Feb 06 '20 at 01:55
I have seen discussions about adding journaling to HDF5, though I haven't been following them. Asking on https://forum.hdfgroup.org/ would probably be your best bet. — James Tocknell, Feb 07 '20 at 00:45

HDF5 possible data corruption or loss?

1 Answers1