1

The data set that I am using is too large to fit into memory to do computations. To circumvent this issue, I am doing the computations in batches and again saving the results to file.

The problem that I have is that my last batch will not be saved to my H5py file, almost certainly because the ending batch size differs from all the previous. Is there any way I can get chunks to be more flexible?

Consider the following MWE:

import h5py
import numpy as np
import pandas as pd
from more_tools import chunked

df = pd.DataFrame({'data': np.random.random(size=113)})
chunk_size = 10
index_chunks = chunked(df.index, chunk_size)

with h5py.File('SO.h5', 'w') as f:
    dset = f.create_dataset('test', shape=(len(df), ), maxshape=(None, ), chunks=True, dtype=np.float32)

    for step, i in enumerate(index_chunks):
        temp_df = df.iloc[i]
        dset = f['test']
        start = step*len(i)
        dset[start:start+len(i)] = temp_df['data']
        dset.attrs['last_index'] = (step+1)*len(i)
# check data
with h5py.File('SO.h5', 'r') as f:
    print('last entry:', f['test'][-10::])  # yields 3 empty values because it did not match the usual batch size
John Stud
  • 1,506
  • 23
  • 46
  • I don't have `more_tools` installed. What does `chunked` return exactly? – Mad Physicist Feb 11 '21 at 17:08
  • It splits the list of indices into n chunks to iterate over. It returns n lists. The package is from `itertools`. – John Stud Feb 11 '21 at 17:19
  • 1
    Chunking is used for initial dataset allocation and improves I/O performance for large files. You DO NOT have to write (or read) data to a dataset in a "chunked shape" (although that's what happens under the covers). You sized the dataset as `len(df)` and don't resize. Is this value larger than the sum of the batch sizes? (I calculate as `sum( step*len(i) )`. If so, you should be fine. If not, that's a problem. When doing incremental writes to a resizable dataset, I recommend error checking to avoid writing past the end or overwriting existing data. – kcw78 Feb 11 '21 at 17:24
  • 1
    `start = step*len(i)` is incorrect for the last chunk. It should be fixed at `start = step*10` – Mad Physicist Feb 11 '21 at 17:32
  • 1
    Better yet, `start = i[0]` – Mad Physicist Feb 11 '21 at 17:34
  • 1
    You (as the programmer) **have** to get the dset indices correct here: `dset[start:start+len(i)]`. I suggest getting the shape of `temp_df['data']` and use with the previous value of the 'last_index' attribute to define your start/stop indices. Without knowing the values of the variables it's hard to do more diagnosis. – kcw78 Feb 11 '21 at 17:39
  • Thanks so much -- these comments have solved the issue! – John Stud Feb 11 '21 at 17:41
  • 1
    Take a look at this answer. It shows how to manage writing incremental, variable sized data blocks to a dataset. [Creating a dataset from multiple hdf5 groups](https://stackoverflow.com/a/66140390/10462884) – kcw78 Feb 11 '21 at 17:42
  • 1
    I've added an answer to help compute the right index – Mad Physicist Feb 11 '21 at 17:44

1 Answers1

1

Your indexing is wrong. step, i goes like this:

 0,   0 ...   9
 1,  10 ...  19
 2,  20 ...  29
...
 9,  90 ...  99
10, 100 ... 109
11, 110 ... 112

For step == 11, len(i) == 3. That makes start = step * len(i) into 11 * 3 == 33, while you're expecting 11 * 10 == 110. You're simply writing to the wrong location. If you inspect the data in the fourth chunk, you will likely find that the fourth, fifth and sixth elements are overwritten by the missing data.

Here is a possible workaround:

last = 0
for step, i in enumerate(index_chunks):
    temp_df = df.iloc[i]
    dset = f['test']
    first = last
    last = first + len(i)
    dset[first:last] = temp_df['data']
    dset.attrs['last_index'] = last
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264