how to read batches in one hdf5 data file for training?

Question

I have a hdf5 training dataset with size (21760, 1, 33, 33). 21760 is the whole number of training samples. I want to use the mini-batch training data with the size 128 to train the network.

I want to ask:

How to feed 128 mini-batch training data from the whole dataset with tensorflow each time?

score 12 · Answer 1 · answered Oct 28 '17 at 09:21

If your data set is so large that it can't be imported into memory like keveman suggested, you can use the h5py object directly:

import h5py
import tensorflow as tf

data = h5py.File('myfile.h5py', 'r')
data_size = data['data_set'].shape[0]
batch_size = 128
sess = tf.Session()
train_op = # tf.something_useful()
input = # tf.placeholder or something
for i in range(0, data_size, batch_size):
    current_data = data['data_set'][position:position+batch_size]
    sess.run(train_op, feed_dict={input: current_data})

You can also run through a huge number of iterations and randomly select a batch if you want to:

import random
for i in range(iterations):
    pos = random.randint(0, int(data_size/batch_size)-1) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

Or sequentially:

for i in range(iterations):
    pos = (i % int(data_size / batch_size)) * batch_size
    current_data = data['data_set'][pos:pos+batch_size]
    sess.run(train_op, feed_dict={inputs=current_data})

You probably want to write some more sophisticated code that goes through all data randomly, but keeps track of which batches have been used, so you don't use any batch more often than others. Once you've done a full run through the training set you enable all batches again and repeat.

This approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, In every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten. — rocksyne, Jun 30 '18 at 14:18

score 7 · Accepted Answer · answered Jul 06 '16 at 14:31

7

You can read the hdf5 dataset into a numpy array, and feed slices of the numpy array to the TensorFlow model. Pseudo code like the following would work :

import numpy, h5py
f = h5py.File('somefile.h5','r')
data = f.get('path/to/my/dataset')
data_as_array = numpy.array(data)
for i in range(0, 21760, 128):
  sess.run(train_op, feed_dict={input:data_as_array[i:i+128, :, :, :]})

answered Jul 06 '16 at 14:31

keveman

8,427
1
38
46

1

Thank you. But when the number of training iterations `i` is large, e.g. 100000, how to feed it ? – karl_TUM Jul 06 '16 at 14:45
If you only have `21760` training samples, you only have `21760/128` distinct mini-batches. You have to write an outer loop around the `i` loop and run many epochs over the training dataset. – keveman Jul 06 '16 at 14:47
1

I have one point confusing. When the original data is shuffled and then extract mini-batches, does it mean that the number of mini-batch is more than `21760/128` ? – karl_TUM Jul 06 '16 at 16:42

score 2 · Answer 3 · answered Jun 30 '18 at 14:56

2

alkamen's approach seems logically right but I have not gotten any positive results using it. My best guess is this: Using code sample 1 above, in every iteration, the network trains afresh, forgetting all that has been learned in the previous loop. So if we are fetching at 30 samples or batches per iteration, at every loop/iteration, only 30 data samples are being used, then at the next loop, everything is overwritten.

Find below a screenshot of this approach

As can be seen, the loss and accuracy always start afresh. I will be happy if anyone could share a possible way around this, please.

answered Jun 30 '18 at 14:56

rocksyne

1,264
15
17

2

You tagged in some other user, my name is spelled with an 'n', not an 'm' =) – alkanen Jun 30 '18 at 16:59
Your accuracy isn't reset, it does improve with each iteration, it doesn't go back to zero. Are you sure that you get an entirely new batch every time to fetch a batch, and that they aren't highly overlapping? That would explain why your accuracy improves so much initially, because you basically re-use the same training data for each iteration. And then when you reset the data and get new batches you possibly randomise things again and get a new set of overlapping batches with data your net hasn't seen before. – alkanen Jun 30 '18 at 17:02
Thanks for your comment. Yes, I fetch new batches everytime per my algorithm and yes the data is shuffled but that is what I end up with and ( I may be wrong) but I have a feeling that my previous answer is what is happening. I will keep looking around. If I do find anything, I will be glad to share. And..... I am sorry I didn't get your name right. Thanks for your time. Cheers! – rocksyne Jul 02 '18 at 02:16
Okay. If it does reset and you're sure your batches don't overlap, it's probably not the data fetching that is wrong, but the model weight handling. I hope you find the problem, best of luck. – alkanen Jul 03 '18 at 08:05
Thanks a lot for your input. Very much appreciated. – rocksyne Jul 03 '18 at 14:03
@rocksyne , I am having a similar problem where the net is not learning after each batch. did you manage to solve this? – CAta.RAy Feb 12 '19 at 17:06
1

@CAta.RAy Unfortunately I did not get any luck with TensorFlow. I have created a github Gist code to help you undertsand. So I switched to using keras. I built a custom generator for fetching the data in batches. Please find it here (https://gist.github.com/rocksyne/a4022afd7a5aaacdfb873218dba21d0c). This function is called Kera's fit_generator function (https://www.pyimagesearch.com/2018/12/24/how-to-use-keras-fit-and-fit_generator-a-hands-on-tutorial/) If you can share a bit more of what you are doing, I could better understand you to provide more tailored answers. – rocksyne Feb 13 '19 at 05:39
@rocksyne. Thanks for the reply. I have a similar situation at the moment. I have my net working in keras just fine but when I try to implement the same one in tensorflow it is not working. I even tried to create the simplest net I could [(my question)](https://stackoverflow.com/q/51290691/2723405) but no luck. I am sure there is something wrong in my code but I can't figure it out. – CAta.RAy Feb 13 '19 at 16:15
@CAta.RAy I have taken a look at your code and from the surface, everything looks fine. I say from the surface because I do not have your data set to try out the codes. I will take a look at the code from this current thread and see if I can find away around with tensorflow. If I do, I will be sure to let you know. – rocksyne Feb 14 '19 at 05:11
@rocksyne Thank you, I really appreciate any help. it is just very weird that it is not learning at all. The weights are never updated... – CAta.RAy Feb 14 '19 at 20:44

how to read batches in one hdf5 data file for training?

3 Answers3

Linked