Merging large h5 datasets

Question

I have 8 large h5 files (~ 100G each), each with many different datasets (say 'x','y','z','h'). I'd like merge all 8 of the 'x' and 'y' datasets into a test.h5 and train.h5 file. Is there a fast way to do this? In total I have 800080 rows so I create first my train file save_file = h5py.File(os.path.join(base_path,'data/train.h5'),'w',libver='latest') and after calculating a random split I create the datasets:

train_file.create_dataset('x', (num_train, 256, 256, 1))
train_file.create_dataset('y',(num_train,1))

[similarly for test_file]

train_indeces = np.asarray([1]*num_train + [0]*num_test)
np.random.shuffle(train_indeces)

then I try iterating over each of my 8 files and saving train/test.

    indeces_index = 0
    last_train_index = 0
    last_test_index = 0
    for e in files:
        print(f'FILE:  {e}')
        rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')

        for j in tqdm(range(rnd_file['x'].shape[0] )):
            if train_indeces[indeces_index]==1:
                train_file['x'][last_train_index] = rnd_file['x'][j]
                train_file['y'][last_train_index] = rnd_file['y'][j]
                last_train_index+=1
            else:
                test_file['x'][last_test_index] = rnd_file['x'][j]
                test_file['y'][last_test_index] = rnd_file['y'][j]
                last_test_index +=1

            indeces_index +=1
        rnd_file.close()

But by my calculations this would take ~12 days to run. Is there a (much) faster way to do this? Thanks in advance.

If I understand, you are copying the contents of each dataset row by row using `for j in tqdm(range(rnd_file['x'].shape[0] ))`. Is that right? If so, that is the slowest way to read and write the data. I/O performance is dominated by the number of read/writes (and NOT by the size). I have several SO answers on this topic. See this link for the way to write an entire array: Methods 3a and 3b in [How can I combine multiple .h5 file?](https://stackoverflow.com/a/58223603/10462884). I/O performance is documented here: (https://stackoverflow.com/a/57963340/10462884) — kcw78, Apr 08 '21 at 19:35

kcw78 · Accepted Answer · 2021-04-08T21:40:15.167

If I understand your method, it has 800,080 read/write operations. It's the large # of "writes" that are killing you. To improve performance, you have to reorder I/O operations to read and write large amounts of data each time.

Typically I would read an entire dataset into an array, then write it to the new file. I read thru your code and see you use train_indeces to randomly select rows of data to write to train_file or test_file. That complicates things "a little bit". :-)

To replicate the randomness, I used np.where() to find the the training and testing rows. Then I used NumPy "fancy indexing" to access the data as an array (after converting to a list). Then, I write that array to the next open slot in the appropriate dataset. (I reused your 3 counters: indeces_index, last_train_index, and last_test_index to keep track of things.)

I with think this will do what you want:
[Caveat: I'm 99% sure this will work, but it was not tested with real data.]

for e in files:
    print(f'FILE:  {e}')
    rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')
    
    rnd_size = rnd_file['x'].shape[0]
    # get an array with the next "rnd_size" indices
    ind_arr = train_indeces[indeces_index:indeces_index+rnd_size]

    # Get training data indices where index==1
    train_idx = np.where(ind_arr==1)[0]  # np.where() returns a tuple
    train_size = len(train_idx)
    
    x_train_arr = rnd_file['x'][train_idx.tolist()]
    train_file['x'][last_train_index:last_train_index+train_size] = x_train_arr
    
    y_train_arr = rnd_file['y'][train_idx.tolist()]
    train_file['y'][last_train_index:last_train_index+train_size] = y_train_arr
    
    # Get test data indices where index==0
    test_idx  = np.where(ind_arr==0)[0]   # np.where() returns a tuple
    test_size = len(test_idx)

    x_test_arr = rnd_file['x'][test_idx.tolist()]
    test_file['x'][last_test_index:last_test_index+test_size] = x_test_arr

    y_test_arr = rnd_file['y'][test_idx.tolist()]
    test_file['y'][last_test_index:last_test_index+test_size] = y_test_arr
    
    indeces_index   += rnd_size 
    last_train_index+= train_size
    last_test_index += test_size
  
    rnd_file.close()

You should consider opening the file with Python's with/as: context manager. Use this:

with h5py.File(f'{base_path}data/{e}', 'r', libver='latest') as rnd_file:

You do not need rnd_file.close with the context manager.

Instead of this:

rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')

Thank you! Writing in batches did speed things up greatly! And thanks for the tip for opening the files with the `with/as` manager, will definitely use that more often in the future. — Ian Benlolo, Apr 14 '21 at 18:58

Merging large h5 datasets

1 Answers1