3

I want to perform Hyperparameter Optimization on my Keras Model. The problem is the dataset is quite big, normally in training I use fit_generator to load the data in batch from disk, but the common package like SKlearn Gridsearch, Talos, etc. only support fit method.

I tried to load the whole data to memory, by using this:

train_generator = train_datagen.flow_from_directory(
    original_dir,
    target_size=(img_height, img_width),
    batch_size=train_nb,
    class_mode='categorical')
X_train,y_train = train_generator.next()

But the when performing gridsearch, the OS kills it because of large memory usage. I also tried to undersampling my dataset to only 25%, but it's still too big.

Anyone has experience in the same scenario with me? Can you please share your strategy to perform Hyperparameter Opimization for large dataset?

From the answer of @dennis-ec, I tried to follow a tutorial of SkOpt in here: http://slashtutorial.com/ai/tensorflow/19_hyper-parameters/ and it was a very comprehensive tutorial

Mehdi Nellen
  • 8,486
  • 4
  • 33
  • 48
Thanh Nguyen
  • 902
  • 2
  • 12
  • 31
  • You can use fit_generator() with Talos. See info here: https://stackoverflow.com/questions/53559068/use-keras-imagedatagenerator-flow-from-directory-with-talos-scan – mikkokotila Jan 20 '19 at 15:56

2 Answers2

2

In my opinion GridSearch is not a good method for hyperparameter optimization, espacially in Deep Learning where you have many hyperparameters.

I would recommend bayesian hyper parameter optimization. Here is a tutorial how to implement this, using skopt. As you can see you need to write a function which does your training and return your validation score to optimize on, so the API does not care if you use fit or fit_generator from keras.

dennis-w
  • 2,166
  • 1
  • 13
  • 23
1

See this question: how use grid search with fit generator in keras

The first answer seems answer your question.

VegardKT
  • 1,226
  • 10
  • 21
  • yes, I also looked into that answer, I tried to modify it to use with the `flow_from_directory` but it's quite complicated for me – Thanh Nguyen Aug 21 '18 at 07:11
  • Oh, yeah I see your comment there now. My bad. I am honestly not sure how you can implement that with flow_from_directory as I dont have much experience with it, but I can present to you an alternate solution: Do a more aggressive undersampling untill you are able to get it to run, use that to do a gridsearch then verify those parameters on your generator. That is a plan B at least, if you're unable to get it to work othervise. – VegardKT Aug 21 '18 at 07:29
  • Yeah, I tried with even 10% sampling, but it's still too large for the memory. My dataset has 9 classes, so 10% is already under the limitation – Thanh Nguyen Aug 21 '18 at 07:40
  • How large is your dataset? (in filesize and number of samples) – VegardKT Aug 21 '18 at 08:40
  • I have 9000 images in total, each image is around 10-20 KB – Thanh Nguyen Aug 21 '18 at 09:35
  • Really? I find it strange that you are running out of memory then – VegardKT Aug 21 '18 at 10:01