I use multiprocessing to generate numerous really-large Pytables (H5) files--large enough to give memory issues if read in single sweep. Each of these files are created using tb.create_table
to allow 3 columns with mixed datatypes--first two columns are integers, third has floats (such as here). Total number of rows in each file can be different.
I want to combine these H5 files to a single H5 file; all separate H5s have datset_1
that need to be appended to a single dataset in the new H5 file.
I modified the answer given here. In my case, I read/append each file/dataset in chunks to the combined H5 file. Wondering if there is computationally faster (or clean) way to do this job?
The minimal working code and sample output is below where I fetch H5 files from /output/
directory:
import os
import numpy as np
import tables as tb
# no. of rows to read per chunk
factor = 10**7
# gather files to combine
file_lst = []
for fl in os.listdir('output/'):
if not fl.startswith('combined'):
file_lst.append(fl)
# combined file name
file_cmb = tb.open_file('output/combined.h5', 'w')
# copy file-1 dataset to new file
file1 = tb.open_file(f'output/{file_lst[0]}', 'r')
z = file1.copy_node('/', name='dataset_1', newparent=file_cmb.root, newname='dataset_1')
print(f'File-0 shape: {file1.root.dataset_1.shape[0]}')
for file_idx in range(len(file_lst)):
if file_idx>0:
file2 = tb.open_file(f'output/{file_lst[file_idx]}', 'r')
file2_dset = file2.root.dataset_1
shape = file2_dset.shape[0]
print(f'File-{file_idx} shape: {shape}')
# determine number of chunks_loops to read entire file2
if shape<factor:
chunk_loop = 1
else:
chunk_loop = shape//factor
size_int = shape//chunk_loop
size_arr = np.repeat(size_int,chunk_loop)
if shape%chunk_loop:
last_size = shape % chunk_loop
size_arr = np.append(size_arr, last_size)
chunk_loop += 1
chunk_start = 0
chunk_end = 0
for alpha in range(size_arr.shape[0]):
chunk_end = chunk_end + size_arr[alpha]
z.append(file2_dset[chunk_start:chunk_end])
chunk_start = chunk_start + size_arr[alpha]
file2.close()
print(f'Combined file shape: {z.shape}')
file1.close()
file_cmb.close()
Sample output:
File-0 shape: 787552
File-1 shape: 56743654
File-2 shape: 56743654
File-3 shape: 56743654
Combined file shape: (171018514,)