1

I want to reshape my h5py dataset like I can do with numpy.reshape(). The following Code is only working if I use numpy.array() at the beginning of the code. But that only works with a small dataset and blows up my memory if I take a bigger one.

import h5py
import numpy as np

#load data
h5py_data_path = 'any\path\to\h5pyData\training.data.h5'
t_data = h5py.File(h5py_data_path,'r')
training_data = t_data['training.data']
######################################
#### Don't want to have this (blows up my memory) ####
training_data = np.array(training_data)
######################################

print('training_data    ',training_data.shape)
#out: training_data     (10203, 5, 341)

#reshape data
######################################
#### That works, but only with upper Numpy Code ####
training_data = training_data.reshape(training_data.shape[0], 1, 5, 341)
######################################

print('training_data    ',training_data.shape)
#out: training_data     (10203, 1, 5, 341)

Is there any native way in h5py to reshape that in any other working way?

H. Senkaya
  • 743
  • 5
  • 27
  • What part of the `h5py` docs don't you understand? – hpaulj May 17 '19 at 17:03
  • `training_data[0:n]` loads a slice of the dataset into memory. – hpaulj May 17 '19 at 17:05
  • 1
    The HDF5 file is an ondisk data structure. As such, there isn't a native `.reshape()` method. However there is a `.resize()` method to add to an existing dataset. When you access a dataset as a numpy array, you are getting a view (in memory) of the ondisk data. In your example, you are adding a dimension to the array (from (10203, 5, 341) to (10203, 1, 5, 341)). What is your intent? If you really need to reshape your training data, you can read the dataset, reshape it and write to new dataset. The new dataset can go in the current file, or a new one. – kcw78 May 17 '19 at 18:57
  • training_data = t_data['training_data'].value.reshape(shape values) worked but still had memory issue. Seems it‘s decent to reprepare the data to its new shape... – H. Senkaya May 17 '19 at 19:24
  • Yeah, that's a large array to reshape. Note `.value` is deprecated in h5py. Recommended method is now like numpy array slicing: `training_data = t_data['training_data'][:]` (to access the entire array). It will accept .reshape() method. – kcw78 May 17 '19 at 19:53
  • Also, check out this SO Q&A: [numpy-reshape-memory-error](https://stackoverflow.com/questions/48958509) for a very similar discussion. It might be useful to your situation. – kcw78 May 17 '19 at 19:54
  • Your deleted answer suggests that loading a dataset with `np.array(dataset)` uses more memory that `dataset.value` or `dataset[:]`. `np.array(dataset)` may be loading as `value` does, and then make a copy of it (and then delete the original load). `np.asarray(dataset)` would have been better, but still not needed. – hpaulj May 18 '19 at 04:05

1 Answers1

0

While a nice to have feature, the H5py documentation is explicit: The dataset rank (number of dimensions) is fixed when it is created.

rocketman
  • 180
  • 3
  • 9