2

I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.

  1. What are the key classes and methods that I need to use From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.

  2. In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?

  3. Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?

  4. In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.

In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.

user798719
  • 9,619
  • 25
  • 84
  • 123

2 Answers2

0

I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.

I can only suggest how to pass little batches at a time.

Question 2:

I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).

You can create an outer loop to:

  • create a little array with only one or two samples from the file
  • use the method train_on_batch passing only that little array.
  • release the memory disposing of the array or filling this same array with the next sample(s).

Question 3:

Also don't know about the h5py file, is the object that opens the file a python generator?

If not, you can create the generator yourself.

The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.

Community
  • 1
  • 1
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • I have since gone onto using pyspark. The algorithms are more rudimentary without all the neural network options, but I'm still at the stage where the algorithm matters less and having more data helps. I personally haven't seen cases where the algorithm matters most. – user798719 Jan 05 '18 at 07:36
0

Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent