Python: Fastest way to subtract elements of datasets of HDF5 file?

Question

Here is one interesting problem.

Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2 in HDF5 file (input.h5). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).

Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta) is between +/-4000. Eventually saving this info in dset of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.

Concern: I initially used .append method but that crashed the execution for 10GBs input. So, I am now using dset.resize method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).

import numpy as np
import time
import h5py
import sys
import csv

f_r = h5py.File('input.h5', 'r+')

dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape

left, right, count = 0,0,0
W = 4000  # Window half-width
n = 1

# **********************************************
#   HDF5 Out Creation 
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)

for j in range(r1):
    e1 = dset1[j,1]

    # move left pointer so that is within -delta of e
    while left < r2 and dset2[left,1] - e1 <= -W:
        left += 1
    # move right pointer so that is outside of +delta
    while right < r2 and dset2[right,1] - e1 <= W:
        right += 1

    for i in range(left, right):
        delta = e1 - dset2[i,1]
        dset.resize(dset.shape[0] + n, axis=0)
        dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
        count += 1

print("\nFinal shape of dataset created: " + str(dset.shape))

f_w.close()

EDIT (Aug 8, chunking HDF5 file as suggested by @kcw78) @kcw78: So, I tried chunking as well. The following works well for small files (<100MB) but the computation time increases incredibly when I play with GBs of data. Can something be improvised in my code to make it fast?

My suspicion is for j loop is computationally expensive and may be the reason, any suggestions ?

filename = 'file.h5'
with h5py.File(filename, 'r') as fid:
    chunks1 = fid["dataset_1"][:, :]

with h5py.File(filename, 'r') as fid:
    chunks2 = fid["dataset_2"][:, :]

print(chunks1.shape, chunks2.shape) # shape is (13900,4) and (138676,4)

count = 0
W = 4000  # Window half-width
# **********************************************
#   HDF5-Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)

# chunk size to read from first/second dataset
size1 = 34850
size2 = 34669
# save "n" no. of subtracted values in dset
n = 10**4
u = 0
fill_index = 0

for c in range(4):  # read 4 chunks of dataset-1 one-by-one
    h = c * size1
    chunk1 = chunks1[h:(h + size1)]

    for d in range(4):  # read chunks of dataset-2
        g = d * size2
        chunk2 = chunks2[g:(g + size2)]
        r2 = chunk2.shape[0]
        left, right = 0, 0

        for j in range(chunk1.shape[0]):  # grab col.2 values from dataset-1
            e1 = chunk1[j, 1]
            while left < r2 and chunk2[left, 1] - e1 <= -W:
                left += 1
            # move right pointer so that is outside of +delta
            while right < r2 and chunk2[right, 1] - e1 <= W:
                right += 1

            for i in range(left, right):
                if chunk1[j, 0]<8193 and chunk2[i, 0] <8193:
                    e2 = chunk2[i, 1] 
                    delta = e1 - e2 # subtract col.2 values 
                    count += 1

                    if fill_index == (n):
                        dset.resize(dset.shape[0] + n, axis=0)
                        dset[u:(u + n), 0:4] = [count, e1, e1, delta]
                        u = u * n
                        fill_index = 0

                    fill_index += 1
        del chunk2
    del chunk1
f_w.close()

print(count)  # these are (no. of) subtracted values such that the difference is between +/- 4000

EDIT (Jul 31) I tried reading in chunks and even using memory mapping. It is efficient if I do not perform any subtraction and just go through the chunks. The for j in range(m): is the one that is inefficient; probably because I am grabbing each value of the chunk from file-1. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of that can be replaced for "for j in range(m):?

size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
                     names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))
 
for chunk1 in chunks1: # grab chunks from file-1
    m, _ = chunk1.shape  
    fp1[0:m,:] = chunk1
    chunks2 = pd.read_csv(filename[1], chunksize=size2,
                          names=['ch', 'tmstp', 'lt', 'rt'])
    for chunk2 in chunks2: # grab chunks from file-2
        k, _ = chunk2.shape  
        fp2[0:k, :] = chunk2
 
        for j in range(m): # Grabbing values from file-1's chunk
            e1 = fp1[j,1] 
            delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
            count += 1
 
        fp2.flush()
        a += k
 
    fp1.flush()
    del chunks2
    i += m
    prog_count += m

You are reading and writing data from `f_r` and `f_w` row-by-row. Right? I/O performance is a function of the # of reads/writes (not the size of the dataset). See this answer: [SO_57953554](https://stackoverflow.com/a/57963340/10462884) I suggest you read a "large" number of rows at one time into temporary arrays (say, 1e4 to 1e6). Then do your calculations and write the new data into `f_w`. This will significantly reduce the number of read/write calls and improve performance. However, you have to track 2 sets of row counters, so it's a little more complicated. — kcw78, Jul 31 '20 at 14:39
@kcw78: Tried something based on your suggestion, please read my EDIT in the question above. — nuki, Jul 31 '20 at 19:09
My comments were about reading a range of rows from `input.h5`, doing your calculations, then writing those values to `data.h5`. I'm not familiar with Pandas or memmap, so can't help with your new code. — kcw78, Jul 31 '20 at 20:54
@kcw78: Hi, so I tried that as well. Please see my edit above. It works for large files but it is quite slow. Can something be improvised? — nuki, Aug 11 '20 at 02:44

Python: Fastest way to subtract elements of datasets of HDF5 file?

0 Answers0

Linked