0

I use multiprocessing to generate numerous really-large Pytables (H5) files--large enough to give memory issues if read in single sweep. Each of these files are created using tb.create_table to allow 3 columns with mixed datatypes--first two columns are integers, third has floats (such as here). Total number of rows in each file can be different.

I want to combine these H5 files to a single H5 file; all separate H5s have datset_1 that need to be appended to a single dataset in the new H5 file.

I modified the answer given here. In my case, I read/append each file/dataset in chunks to the combined H5 file. Wondering if there is computationally faster (or clean) way to do this job?

The minimal working code and sample output is below where I fetch H5 files from /output/ directory:

import os
import numpy as np
import tables as tb

# no. of rows to read per chunk
factor = 10**7

# gather files to combine
file_lst = []
for fl in os.listdir('output/'):
    if not fl.startswith('combined'):
        file_lst.append(fl)

# combined file name
file_cmb = tb.open_file('output/combined.h5', 'w')
# copy file-1 dataset to new file
file1 = tb.open_file(f'output/{file_lst[0]}', 'r')
z = file1.copy_node('/', name='dataset_1', newparent=file_cmb.root, newname='dataset_1')
print(f'File-0 shape: {file1.root.dataset_1.shape[0]}')

for file_idx in range(len(file_lst)):
    if file_idx>0:
        file2 = tb.open_file(f'output/{file_lst[file_idx]}', 'r')
        file2_dset = file2.root.dataset_1
        shape = file2_dset.shape[0]
        print(f'File-{file_idx} shape: {shape}')

        # determine number of chunks_loops to read entire file2
        if shape<factor:
            chunk_loop = 1
        else:
            chunk_loop = shape//factor

        size_int = shape//chunk_loop
        size_arr = np.repeat(size_int,chunk_loop)

        if shape%chunk_loop:
            last_size = shape % chunk_loop
            size_arr = np.append(size_arr, last_size)
            chunk_loop += 1

        chunk_start = 0
        chunk_end = 0

        for alpha in range(size_arr.shape[0]):
            chunk_end = chunk_end + size_arr[alpha]
            z.append(file2_dset[chunk_start:chunk_end])
            chunk_start = chunk_start + size_arr[alpha]
        file2.close()

print(f'Combined file shape: {z.shape}')
file1.close()
file_cmb.close()

Sample output:

File-0 shape: 787552
File-1 shape: 56743654
File-2 shape: 56743654
File-3 shape: 56743654
Combined file shape: (171018514,)
nuki
  • 101
  • 5
  • Linked answer is a good approach (author is PyTables primary developer). if you use that approach, consider this: The first copied dataset inherits compression and chunking properties from the original. You may want to change them. Also, there are other ways to do this. I wrote an answer that demonstrates multiple methods with both PyTables and h5py. See this answer: [How can I combine multiple .h5 file?](https://stackoverflow.com/a/58293398/10462884) – kcw78 Jul 16 '21 at 21:21

1 Answers1

0

You have the right idea. I prefer context managers for file handling and the logic to loop and make incremental copies was hard to follow (and you don't need arrays - you can do the calculations on the fly). I took at stab at refactoring. However, without the input files, I couldn't debug, so there may be minor errors.

# no. of rows to read per chunk
factor = 10**7

# gather files to combine
file_lst = []
for fl in os.listdir('output/'):
    if not fl.startswith('combined'):
        file_lst.append(fl)

# combined file name
with tb.File('output/combined.h5', 'w') as file_cmb:
    for file_idx, filename in enumerate(file_lst):
        if file_idx == 0:
    # copy file-1 dataset to new file
            with tb.File(f'output/{filename}', 'r') as file1:
                z = file1.copy_node('/', name='dataset_1', newparent=file_cmb.root, newname='dataset_1')
                print(f'File1-{filename} shape: {file1.root.dataset_1.shape[0]}')
        
        else:
            with tb.File(f'output/{filename}', 'r') as file2:
                file2_dset = file2.root.dataset_1
                shape = file2_dset.shape[0]
                print(f'File2-{filename} shape: {shape}')
        
                chunk_loops = shape//factor
                if shape > chunk_loops*factor:
                    chunk_loops += 1
                
                chunk_start, chunk_end = 0, 0
                for alpha in range(chunk_loops):                   
                    if chunk_start + factor > shape:
                        chunk_end = shape
                    else:
                        chunk_end = chunk_start + factor
                        
                    z.append(file2_dset[chunk_start:chunk_end])
                    chunk_start = chunk_end
                       
    print(f'Combined file shape: {z.shape}')
kcw78
  • 7,131
  • 3
  • 12
  • 44