14

On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:

  • test_iter
  • test_interval
  • iter_size
  • max_iter

Does it depend on the number of images you have for your training set? If so, how?

Shai
  • 111,146
  • 38
  • 238
  • 371
Abhilash Panigrahi
  • 1,455
  • 1
  • 13
  • 31
  • Another meta parameter is `weight_decay`. see [this thread](http://stackoverflow.com/q/32177764/1714410) on how to set it. – Shai Mar 16 '16 at 08:46

1 Answers1

30

In order to set these values in a meaningful manner, you need to have a few more bits of information regarding your data:

1. Training set size the total number of training examples you have, let's call this quantity T.
2. Training batch size the number of training examples processed together in a single batch, this is usually set by the input data layer in the 'train_val.prototxt'. For example, in this file the train batch size is set to 256. Let's denote this quantity by tb.
3. Validation set size the total number of examples you set aside for validating your model, let's denote this by V.
4. Validation batch size value set in batch_size for the TEST phase. In this example it is set to 50. Let's call this vb.

Now, during training, you would like to get an un-biased estimate of the performance of your net every once in a while. To do so you run your net on the validation set for test_iter iterations. To cover the entire validation set you need to have test_iter = V/vb.
How often would you like to get this estimation? It's really up to you. If you have a very large validation set and a slow net, validating too often will make the training process too long. On the other hand, not validating often enough may prevent you from noting if and when your training process failed to converge. test_interval determines how often you validate: usually for large nets you set test_interval in the order of 5K, for smaller and faster nets you may choose lower values. Again, all up to you.

In order to cover the entire training set (completing an "epoch") you need to run T/tb iterations. Usually one trains for several epochs, thus max_iter=#epochs*T/tb.

Regarding iter_size: this allows to average gradients over several training mini batches, see this thread fro more information.

mj_
  • 6,297
  • 7
  • 40
  • 80
Shai
  • 111,146
  • 38
  • 238
  • 371
  • @Shai - Thank you for detailed example. I am still confused due to my use case of `Caffe` with `AlexNet`. I have system with `115GB` memory and using `train` and `val` data set in LMDB from `ImageNet ILSVRC 2012`. I am using [this solver file](https://github.com/intel/caffe/blob/master/models/intel_optimized_models/alexnet/solver.prototxt) with all parameters as it's except `max_iteration=100`. I fail to understand why the memory consumption is approximately `10GB`? It should be way smaller as `Caffe` operates on batch of images instead full data . Any idea how this calculation is done? – Chetan Arvind Patil Oct 22 '17 at 22:51
  • @ChetanArvindPatil it seems like you are confusing storage requirements for model's parameters and RAM usage for train/val computation. Caffe stores in memory all **parameters** +their derivatives, additionally it stores **data** (train/val batches) and derivatives (for backprop). Some solvers ever requires additional storage for per-parameter adjustable learning rate (e.g. `"Adam"`). All these can certainly require a lot of RAM. See e.g. [this thread](https://stackoverflow.com/q/36526959/1714410) – Shai Oct 23 '17 at 05:31