Is it good learning rate for Adam method?

Question

I am training my method. I got the result as below. Is it a good learning rate? If not, is it high or low? This is my result

lr_policy: "step"
gamma: 0.1
stepsize: 10000
power: 0.75
# lr for unnormalized softmax
base_lr: 0.001
# high momentum
momentum: 0.99
# no gradient accumulation
iter_size: 1
max_iter: 100000
weight_decay: 0.0005
snapshot: 4000
snapshot_prefix: "snapshot/train"
type:"Adam"

This is reference

With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss

Theres very little context here, but it looks fine. You can try increasing the learning rate (to save training time) until you see it no longer converges. What is the training set accuracy by the end? — Simon, Mar 23 '17 at 04:51
@Simon: In above setting, the final error rate at 50000 iterations is 0.05. I am increasing the base_lr to 0.002, instead of 0.001 to see the improvement — John, Mar 23 '17 at 05:22
Adam has an inside LR, so changing the external LR over steps may not make sense. — Ali Abbasi, May 27 '17 at 09:28

score 18 · Answer 1 · edited Feb 27 '19 at 17:34

The learning rate looks a bit high. The curve decreases too fast for my taste and flattens out very soon. I would try 0.0005 or 0.0001 as a base learning rate if I wanted to get additional performance. You can quit after several epochs anyways if you see that this does not work.

The question you have to ask yourself though is how much performance do you need and how close you are to accomplishing the performance required. I mean that you are probably training a neural network for a specific purpose. Often times you can get more performance out of the network by increasing its capacity, instead of fine tuning the learning rate which is pretty good if not perfect anyways.

score 12 · Answer 2 · answered Mar 23 '17 at 04:47

12

You can start with a higher learning rate (say 0.1) to get out of local minima then decrease it to a very small value to let settle down things. To do this change the step size to say 100 iterations to reduce the size of the learning rate every 100 iterations. These numbers are truly unique to your problem and depend on multiple factors like your data scale.

Also keep in mind the validation loss behavior on the graph to see if you are overfitting the data.

answered Mar 23 '17 at 04:47

Juan Zapata

1,818
14
19

3

Isn't Adam supposed to do it all by itself? The lrate should be set to a fixed number is not very important (e.g. 1 or 1e-1) – Alex Sep 20 '17 at 11:47
3

Even in the Adam optimization method, the learning rate is a hyperparameter and needs to be tuned, learning rate decay usually works better than not doing it. – Juan Zapata Sep 23 '17 at 20:01
1

Adam sometimes gave me "U" shaped loss curve with the default lr=0.001, whereas low lr didn't: https://gist.github.com/Naetmul/218a6f9e4f1523d24bea5ac02c1b450d – Naetmul Nov 02 '18 at 05:48

score 7 · Answer 3 · answered May 26 '18 at 15:00

I would like to be more specific in some statements of Juan. But my reputaton is not enough so I post it as an answer instead.

You should not be afraid of local minimums. In practice, as far as my understanding, we can classify them as 'good local minimums' and 'bad local minimums'. The reason why we want to have higher learning rate, as Juan said, is that we want to find a better 'good local minimum'. If you set your initial learning rate too high, that will be bad because your model will likely fall in 'bad local minimum' regions. And if that happens, 'decaying learning rate' practice cannot help you.

Then, how can we ensure that your weights will fall in the good region? The answer is we can't, but we can increase its possibility by choosing a good set of initial weights. Once again, a too big initial learning rate will make your initialization meaningless.

Secondly, it's always good to understand your optimizer. Take some time to look at its implementation, you will find something interesting. For example, 'learning rate' is not actually 'learning rate'.

In sum: 1/ Needless to say,a small learning rate is not good, but a too big learning rate is definitely bad. 2/ Weight initialization is your first guess, it DOES affect your result 3/ Take time to understand your code may be a good practice.

score 5 · Answer 4 · answered Apr 06 '21 at 18:08

5

People have done a lot of experimentation when it comes to choosing hyper-parameter of adam and by far 3e-4 to 5e-4 are the best learning rates if you're learning the task from scratch.

Note if you're doing transfer learning and fine tuning the model then keep the learning rate low cause initially the gradients would be larger and backpropagation will affect the pre-trained model more drastically. You do not want that to happen at the start of training

answered Apr 06 '21 at 18:08

Saurabh Kumar

2,088
14
17

4

A simple google search of "best learning rate for adam" is currently telling me "3e-4 is the best learning rate for Adam, hands down." which sounds pretty convincing. – John St. John Nov 04 '21 at 20:21
[Andrej Karpathy thinks so.](https://twitter.com/karpathy/status/801621764144971776?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E801621764144971776%7Ctwgr%5E%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fbrandonlmorris.github.io%2F2018%2F06%2F24%2Fmastering-the-learning-rate%2F) – anton-sturluson May 09 '22 at 21:35
2

This quote is a tweet from Andrej Karpathy, to which he replied: "(i just wanted to make sure that people understand that this is a joke...)". Just FYI. – roygbiv Feb 16 '23 at 15:15

score 3 · Answer 5 · answered Jun 22 '20 at 20:24

Adam is an optimizer method, the result depend of two things: optimizer (including parameters) and data (including batch size, amount of data and data dispersion). Then, I think your presented curve is ok.

Concerning the learning rate, Tensorflow, Pytorch and others recommend a learning rate equal to 0.001. But in Natural Language Processing, the best results were achieved with learning rate between 0.002 and 0.003.

I made a graph comparing Adam (learning rate 1e-3, 2e-3, 3e-3 and 5e-3) with Proximal Adagrad and Proximal Gradient Descent. All of them are recommended to NLP, if this is your case.

Is it good learning rate for Adam method?

5 Answers5

Linked