In what order should we tune hyperparameters in Neural Networks?

Question

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.

For now, I saw many different hyperparameters that I have to tune :

Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
batch-size
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter

I have 2 questions :

1) Do you see any other hyperparameter I might have forgotten ?

2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way. Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ... I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ? Is there any paper or article which talks about properly tuning all the parameters in a special order ?

EDIT : Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?

Accuracy

Cost

For the learning rate, I used the decay : LR(t) = LRI/sqrt(epoch)

Thanks for your help ! Paul

Hi Paul, I wonder why you use `LRI/sqrt(epoch)` as the learning rate decay? I'm using `LRI/max(epoch_0, epoch)`, where I have set `epoch_0` to the epoch in which I want the decay to start, but maybe you get faster convergence if you take the squarer root of the denominator like you do. Do you have any reference for that learning rate decay or did you come up with it more or less yourself? — HelloGoodbye, Jun 27 '16 at 22:16
Hi @HelloGoodbye ! In the article presenting the Adam Optimizer, (https://arxiv.org/pdf/1412.6980.pdf), they use a Square Root decay for the Learning Rate to prove the convergence of the Theorem 4.1 . — Paul Rolin, Jun 28 '16 at 15:51

score 7 · Answer 1 · edited May 23 '17 at 11:54

My general order is:

Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
- Number of neurons in the network
- Number of layers
Rest (dropout, L2 reg, etc.)

Dependencies:

I'd assume that the optimal values of

learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers

strongly depend on each other. I am not an expert on that field though.

As for your hyperparameters:

For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.

RMSprop is almost equivalent to AdamOptimizer with `beta1=0`. (AdamOptimizer corrects for bias in the RMS term initially, but the bias and correction both approach zero after enough training steps.) — Charles Staats, Jan 19 '17 at 01:51
@CharlesStaats Thanks for the input! Searching for the difference between Adam and RMSprop, I found this: http://cs231n.github.io/neural-networks-3/#ada "Notice that the update looks exactly as RMSProp update, except the “smooth” version of the gradient m is used instead of the raw (and perhaps noisy) gradient vector dx. Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp." So you're right. I'll use Adam from now on. — Kilian Obermeier, Jan 19 '17 at 08:09

score 4 · Answer 2 · answered May 26 '16 at 22:37

4

Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.

answered May 26 '16 at 22:37

Phillip Bock

1,879
14
23

Thanks ! I didn't know we could display different graphs in the same Tensorboard window. Will look that way, even if it doesn't quite answer my initial questions. – Paul Rolin May 27 '16 at 14:23

score 2 · Answer 3 · answered May 26 '16 at 19:54

2

For parameters that are less important you can probably just pick a reasonable value and stick with it.

Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.

answered May 26 '16 at 19:54

Aaron

2,354
1
17
25

Thanks for your answer ! So it is better to randomly select values for ALL the hyperparameters and tune them altogether than to tune each one separately ? – Paul Rolin May 26 '16 at 21:01
Yes. Unless there is some hyperparameter that you know doesn't matter that much. For those ones you can just pick a value and optimize the rest. For example, people might arbitrarily decide to use two layers and sigmoid activations but then optimize over the size of each layer. – Aaron May 27 '16 at 17:03
Thanks ! I've added the graphs I got from randomization choices of values for initial learning rate, regularization parameter and batch size. I can't figure why the model gets stuck with such a low accuracy... – Paul Rolin May 27 '16 at 17:26

score 0 · Answer 4 · answered Jan 20 '17 at 08:03

I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.

https://github.com/kuz/caffe-with-spearmint

for more information, you can follow this tutorial as well:

http://fastml.com/optimizing-hyperparams-with-hyperopt/

For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.

score 0 · Answer 5 · answered Jul 27 '20 at 11:41

Before you begin:

Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.

Then, if you want to do it one by one, I would go like this:

Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.

In what order should we tune hyperparameters in Neural Networks?

5 Answers5