1

I want to append new date to my already created HDF5 file but I don't how to append more data to it, I don't know the actual syntax for appending

I have created an HDF5 file to save my data in HDF format as

with h5py.File(save_path+'PIC200829_256x256x256x3_fast_sj1.hdf5', 'w') as db:
       db.create_dataset('Predframes', data=trainX)
       db.create_dataset('GdSignal', data=trainY)
# this can create an hdf5 file with given name
# and save the data in the given format 

what I want is that I want to append more data (same data) to it to, in next iteration, instead of overwriting and creating new HDF file , one thing I know that I will change "w" to "a" but I don't know what I need to write for append instead of create

Instead of db.create_dataset('Predframes', data=trainX) as db.append('Predframes', data=trainX) is not the right format/syntax? What should I write to append instead of create?

The shape of the trainX is (2500, 100, 100, 40) so when the next trainX with same shape (2500, 100, 100, 40) is appended with the first one, its size should be (5000, 100, 100, 40) while the size of trainY is (2500,80). After appending it should be (5000, 80)

Homer512
  • 9,144
  • 2
  • 8
  • 25
Saran Zeb
  • 311
  • 3
  • 7
  • Does this answer your question? [Incremental writes to hdf5 with h5py](https://stackoverflow.com/questions/25655588/incremental-writes-to-hdf5-with-h5py) – Homer512 Dec 27 '22 at 09:10
  • Thank You for your response but the answered provided there is not appending the new data to old one they are expanding and resizing the already created one . – Saran Zeb Dec 27 '22 at 12:29
  • That's what you do. You create a dataset that can be resized. Then you repeatedly do that. – Homer512 Dec 27 '22 at 15:11
  • thanks but actually, but in my case i don't need to resize them, I have already created HDF file and have saved data in it , now I need to append more data to it, its very simple I am just saving my data in HDF file but I cant do all at once because of memory constraint, so I need to them in different steps so for the first time i saved the data in HDF file now if I do the same for next step it will overwrite the already saved data which I don't want I want to append the data of next step to the already saved data – Saran Zeb Dec 27 '22 at 16:41
  • 1. Create the file 2. Create the data set with ```maxshape``` set to None in the dimension that you want to append data as shown in the other answer. 3. Write initial data. 4. Open file with ```mode='a'```. 5. Open existing data set. 6. Resize dataset to to add one more row (or whatever) as shown in the other answer. 7. Write new data to new location – Homer512 Dec 27 '22 at 16:47
  • sorry that answer is different from mine i dont need to create more rows neither i need to resize i have already created, the shape , i just want to append extra data – Saran Zeb Dec 27 '22 at 19:16
  • What is the shape of the dataset before appending, what is the intended shape after appending? – Homer512 Dec 27 '22 at 20:23
  • sir the shape is same before and after the appending i have already explained in my question why I am appending that my ram cant hold all the data which I am processing, so I need to do it in many steps, i prepared my data and saved it HDF file with the size i explained in question, now processed some more data which i have to append to the file i have already created in first step but I don't know how to append it. – Saran Zeb Dec 28 '22 at 18:38
  • Conceptually, HDF5 datasets act like numpy arrays. You created your dataset with the ```data=trainX``` parameter. That means the dataset has the exact same shape and content as the ```trainX``` array. Now you're saying you have more data to append. Clearly, that data has not been part of the ```trainX``` array. So, again, where do you want the data to go? – Homer512 Dec 28 '22 at 19:58
  • Exactly I am saving the Numpy arrays which are data=trainX (consist of many Numpy arrays) ,so the content of traix which are numpy arrays are saved in HDF file, now in the next step I processed again some data which are also numpy arrays with the name data=trainX which I want append to the numpy arrays of first step. – Saran Zeb Dec 29 '22 at 17:18
  • If you append data, the size of the dataset changes, right? So to do that, you resize the dataset, then put the new data into the newly allocated locations. I don't know how many more words I can use to describe the exact same thing. – Homer512 Dec 29 '22 at 17:42
  • Hmcan you write the code for that, how I can implement it my Case – Saran Zeb Dec 29 '22 at 17:57
  • Well, what do you want the shape to be? What is the array shape of ```trainX```, how do you want it to be after appending? You said you do this because you don't have enough memory for the whole training set. How would it look if you had the memory? Like ```np.concatenate((batch1, batch2))``` or like ```np.stack((batch1, batch2))```? Or something else? – Homer512 Dec 29 '22 at 20:12
  • the shape of the trainX, is (2500, 100, 100, 40) so when the next trainX with same shape (2500, 100, 100, 40) is appended with the first one, its size should be(5000, 100, 100, 40) while the size of trainY is (2500,80), after appending it should be (5000,80) – Saran Zeb Dec 30 '22 at 18:20

1 Answers1

0

Here is the required code. The initial creation of the dataset has to specify that the outermost dimension should be able to be resized.

from os import path

def create_for_append(h5file, name, data):
    data = np.asanyarray(data)
    return h5file.create_dataset(
          name, data=data, maxshape=(None,) + data.shape[1:])


filepath = path.join(save_path, 'PIC200829_256x256x256x3_fast_sj1.hdf5')
with h5py.File(filepath, 'w') as db:
    create_for_append(db,'Predframes', trainX)
    create_for_append(db,'GdSignal', trainY)

Then we can append the new data by resizing the dataset and putting the new data in the newly allocated range.

def append_to_dataset(dataset, data):
    data = np.asanyarray(data)
    dataset.resize(len(dataset) + len(data), axis=0)
    dataset[-len(data):] = data


with h5py.File(filepath, 'a') as db:
    append_to_dataset(db['Predframes'], trainX)
    append_to_dataset(db['GdSignal'], trainY)
Homer512
  • 9,144
  • 2
  • 8
  • 25