2

I was going through a basic PyTorch MNIST example here and noticed that when I changed the optimizer from SGD to Adam the model did not converge. Specifically, I changed line 106 from

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

to

optimizer = optim.Adam(model.parameters(), lr=args.lr)

I thought this would have no effect on the model. With SGD the loss quickly dropped to low values after about a quarter of an epoch. However with Adam, the loss did not drop at all even after 10 epochs. I'm curious to why this is happening; it seems to me these should have nearly identical performance.

I ran this on Win10/Py3.6/PyTorch1.01/CUDA9

And to save you a tiny bit of code digging, here are the hyperparams:

  • lr=0.01
  • momentum=0.5
  • batch_size=64
desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

5

Adam is famous for working out of the box with its default paremeters, which, in almost all frameworks, include a learning rate of 0.001 (see the default values in Keras, PyTorch, and Tensorflow), which is indeed the value suggested in the Adam paper.

So, I would suggest changing to

optimizer = optim.Adam(model.parameters(), lr=0.001)

or simply

optimizer = optim.Adam(model.parameters())

in order to leave lr in its default value (although I would say I am surprised, as MNIST is famous nowadays for working practically with whatever you may throw into it).

desertnaut
  • 57,590
  • 26
  • 140
  • 166