Here is one interesting problem.
Input: Input is two arrays (Nx4, sorted in column-2) stored in datasets-1 and 2
in HDF5 file (input.h5
). N is huge (originally belonging to 10 GB of file, hence stored in HDF5 file).
Output: Subtracting each column-2 element of dataset-2 from dataset-1, such that the difference (delta
) is between +/-4000. Eventually saving this info in dset
of a new HDF5 file. I need to refer to this new file back-and-forth, hence HDF5 not a text file.
Concern: I initially used .append
method but that crashed the execution for 10GBs input. So, I am now using dset.resize
method (and would like to stick to it preferably). I am also using binary search as I was told in one of my last posts. So now, although the script seems to be working for large (10 GBs) of datasets, it is quite slow! The subtraction (for/while) loop is possibly the culprit! Any suggestions on how I can make this fast? I aim to use the fastest approach (and possibly the simplest, since I am a beginner).
import numpy as np
import time
import h5py
import sys
import csv
f_r = h5py.File('input.h5', 'r+')
dset1 = f_r.get('dataset_1')
dset2 = f_r.get('dataset_2')
r1,c1 = dset1.shape
r2,c2 = dset2.shape
left, right, count = 0,0,0
W = 4000 # Window half-width
n = 1
# **********************************************
# HDF5 Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
for j in range(r1):
e1 = dset1[j,1]
# move left pointer so that is within -delta of e
while left < r2 and dset2[left,1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and dset2[right,1] - e1 <= W:
right += 1
for i in range(left, right):
delta = e1 - dset2[i,1]
dset.resize(dset.shape[0] + n, axis=0)
dset[count, 0:4] = [count, dset1[j,1], dset2[i,1], delta]
count += 1
print("\nFinal shape of dataset created: " + str(dset.shape))
f_w.close()
EDIT (Aug 8, chunking HDF5 file as suggested by @kcw78) @kcw78: So, I tried chunking as well. The following works well for small files (<100MB) but the computation time increases incredibly when I play with GBs of data. Can something be improvised in my code to make it fast?
My suspicion is for j loop
is computationally expensive and may be the reason, any suggestions ?
filename = 'file.h5'
with h5py.File(filename, 'r') as fid:
chunks1 = fid["dataset_1"][:, :]
with h5py.File(filename, 'r') as fid:
chunks2 = fid["dataset_2"][:, :]
print(chunks1.shape, chunks2.shape) # shape is (13900,4) and (138676,4)
count = 0
W = 4000 # Window half-width
# **********************************************
# HDF5-Out Creation
# **********************************************
f_w = h5py.File('data.h5', 'w')
d1 = np.zeros(shape=(0, 4))
dset = f_w.create_dataset('dataset_1', data=d1, maxshape=(None, None), chunks=True)
# chunk size to read from first/second dataset
size1 = 34850
size2 = 34669
# save "n" no. of subtracted values in dset
n = 10**4
u = 0
fill_index = 0
for c in range(4): # read 4 chunks of dataset-1 one-by-one
h = c * size1
chunk1 = chunks1[h:(h + size1)]
for d in range(4): # read chunks of dataset-2
g = d * size2
chunk2 = chunks2[g:(g + size2)]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(chunk1.shape[0]): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
while left < r2 and chunk2[left, 1] - e1 <= -W:
left += 1
# move right pointer so that is outside of +delta
while right < r2 and chunk2[right, 1] - e1 <= W:
right += 1
for i in range(left, right):
if chunk1[j, 0]<8193 and chunk2[i, 0] <8193:
e2 = chunk2[i, 1]
delta = e1 - e2 # subtract col.2 values
count += 1
if fill_index == (n):
dset.resize(dset.shape[0] + n, axis=0)
dset[u:(u + n), 0:4] = [count, e1, e1, delta]
u = u * n
fill_index = 0
fill_index += 1
del chunk2
del chunk1
f_w.close()
print(count) # these are (no. of) subtracted values such that the difference is between +/- 4000
EDIT (Jul 31)
I tried reading in chunks and even using memory mapping. It is efficient if I do not perform any subtraction and just go through the chunks. The for j in range(m):
is the one that is inefficient; probably because I am grabbing each value of the chunk from file-1. This is when I am just subtracting and not saving the difference. Any better logic/implementation you can think of that can be replaced for "for j in range(m):
?
size1 = 100_000_0
size2 = 100_000_0
filename = ["file-1.txt", "file-2.txt"]
chunks1 = pd.read_csv(filename[0], chunksize=size1,
names=['c1', 'c2', 'lt', 'rt'])
fp1 = np.memmap('newfile1.dat', dtype='float64', mode='w+', shape=(size1,4))
fp2 = np.memmap('newfile2.dat', dtype='float64', mode='w+', shape=(size2,4))
for chunk1 in chunks1: # grab chunks from file-1
m, _ = chunk1.shape
fp1[0:m,:] = chunk1
chunks2 = pd.read_csv(filename[1], chunksize=size2,
names=['ch', 'tmstp', 'lt', 'rt'])
for chunk2 in chunks2: # grab chunks from file-2
k, _ = chunk2.shape
fp2[0:k, :] = chunk2
for j in range(m): # Grabbing values from file-1's chunk
e1 = fp1[j,1]
delta_mat = e1 - fp2 # just a test, actually e1 should be subtracted from col-2 of fp2, not the whole fp2
count += 1
fp2.flush()
a += k
fp1.flush()
del chunks2
i += m
prog_count += m