2

I am currently adding to a h5py array, having defined an array as:

f = h5py.File(batch_path,'w')
data = f.create_dataset('data_set',(525600,1300),dtype=np.float32)

and adding arrays to it as:

for index,file in enumerate(files):
    df = pd.read_csv(file)
    result = np.array(list(map(lambda x: float(x.split(';')[1]),df.as_matrix()[:,0])))
    data[:,index] = result[:]

However, the last step (data[:,index] = result[:]) takes an incredible amount of time. What is wrong here?

Erik
  • 21
  • 2
  • 1
    I have seen inappropriate chunking massively affect read & write performance. If you know what kind of out-of-memory queries you will be performing on your HDF5 dataset, you can select an appropriate chunk size. See [h5py chunking docs](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage) for details. – jpp Mar 09 '18 at 10:25
  • Any difference if you save to a (1300,525600) set, iterating on 1st dimension. – hpaulj Mar 09 '18 at 15:45
  • 1) Use chunking. For example https://stackoverflow.com/a/48405220/4045774 2) Take also a look at the "The simplest form of fancy slicing". This will also have a noticeable impact on performance. – max9111 Mar 20 '18 at 12:28

0 Answers0