0

Consider the following code, which simply dumps one million 2-Byte integers into a HDF5 file using HDFql:

std::string filepath = "/tmp/test.h5";
sprintf(script_, "CREATE TRUNCATE FILE %s", filepath.c_str());
HDFql::execute(script_);
sprintf(script_, "USE FILE %s", filepath.c_str());
HDFql::execute(script_);

HDFql::execute("CREATE CHUNKED DATASET data AS SMALLINT(UNLIMITED)");

const int data_size = 1000000;
std::vector<uint16_t> data(data_size);
HDFql::variableRegister(&data[0]);

for(int i=0; i<data_size; i++) {data.at(i)=i;}

sprintf(script_, "ALTER DIMENSION data TO +%d", num_data-1);
HDFql::execute(script_);

sprintf(script_, "INSERT INTO data(-%d:1:1:%d) VALUES FROM MEMORY 0", 0, num_data);
HDFql::execute(script_);

Since HDF5 is an efficient binary method of storing data, I'd expect this file size to be around 1E6*2 ~ 2MB big. Instead the file size is ~40MB! That's around 20 times larger than you'd expect. I found this after using HDFql to convert one binary format to HDF5, the resulting HDF5 files were way bigger than the original binary. Does anyone know what's going on here?

Many thanks!

Mr Squid
  • 1,196
  • 16
  • 34
  • 1
    This is not an issue that is peculiar to HDFql, but rather to the HDF5 data format itself. Since you are not explicitly defining a chunk size and because the dataset's dimension is unlimited, HDFql will automatically set this size to 1, which according to some info posted in http://hdf-forum.184993.n3.nabble.com/Questions-about-size-of-generated-Hdf5-files-td4029689.html will generate a lot of metadata that is stored in the HDF5 file (thus bloating its size). – SOG Apr 30 '20 at 10:34
  • So the size could be reduced by an upper bound on the size and defining a chunk size? – Mr Squid Apr 30 '20 at 11:58
  • Yes, that could be. – SOG May 03 '20 at 12:30

0 Answers0