Pytables: Can Appended Earray be reduced in size?

Question

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.

I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.

Output and input files' details here

# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20 

# save to disk after these many rows
app_len = 10**6 

# **********************************************
#       Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]

f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))

size1 = shape1//loop_1
size2 = shape2//loop_2

# ***************************************************
#       Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
    h = c*size1
    # grab chunks from dset_1 of inp.h5  
    chunk1 = chunks1[h:(h + size1)]

    for d in range(loop_2):
        g = d*size2
        chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5 
        r1 = chunk1.shape[0]
        r2 = chunk2.shape[0]
        left, right = 0, 0

        for j in range(r1):  # grab col.2 values from dataset-1
            e1 = chunk1[j, 1]
            #...Algaebraic operations here to output a row containing 4 float64
            #...append to a (earray) when no. of rows reach a million
        del chunk2
    del chunk1
f2.close()

There's an interesting pattern when comparing file size for Input vs Output files. For small files (<167MB), Input>Output size. For larger files, Output>Input size. I suspect 2 factors will help: 1) Add `expectedrows=` parameter (_'this will optimize the HDF5 B-Tree and amount of memory used'_), 2) add compression (use `filters=` parameter). If chunkshape isn't set, _'a sensible value is calculated based on the expectedrows parameter'_. This won't decrease the output file size, but will improve I/O performance. — kcw78, Aug 19 '20 at 22:22

score 1 · Accepted Answer · answered Aug 24 '20 at 15:51

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.

Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).

PyTables code suggestions:

Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.

If you didn't use compression when you created the file, there are 2 methods to compress existing files:

External Utilities:

The PyTables utility ptrepack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.

There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

Thanks! My script now includes the `filters` and `expectedrows` parameter. When you say "reduces I/O performance"...what are the implications? The file access time will be reduced? — nuki, Aug 28 '20 at 23:37
The file access time will increase. Reading a compressed file is slower than reading one that is not compressed. Additional computations are required to read the compressed data. (Basically you are "unzipping" the data on the fly). The time penalty depends on the size of the file and data you are reading. You can benchmark with your data. Create a little program to read some data, then run against compressed and uncompressed versions of the file. (Use the ptrepack utility to create a copy.) — kcw78, Aug 29 '20 at 15:06

Pytables: Can Appended Earray be reduced in size?

1 Answers1