The data set that I am using is too large to fit into memory to do computations. To circumvent this issue, I am doing the computations in batches and again saving the results to file.
The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous. Is there any way I can get chunks
to be more flexible?
Consider the following MWE:
import h5py
import numpy as np
import pandas as pd
from more_tools import chunked
df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)
with h5py.File('SO.h5', 'w') as f:
dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)
for step, i in enumerate(index_chunks):
temp_df = df.iloc[i]
dset = f['test']
start = step*len(i)
dset[start:start+len(i)] = temp_df['data']
dset.attrs['last_index'] = (step+1)*len(i)
# check data
with h5py.File('SO.h5', 'r') as f:
print('last entry:', f['test'][-10::]) # yields 3 empty values because it did not match the usual batch size