I was going through a basic PyTorch MNIST example here and noticed that when I changed the optimizer from SGD to Adam the model did not converge. Specifically, I changed line 106 from
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
to
optimizer = optim.Adam(model.parameters(), lr=args.lr)
I thought this would have no effect on the model. With SGD the loss quickly dropped to low values after about a quarter of an epoch. However with Adam, the loss did not drop at all even after 10 epochs. I'm curious to why this is happening; it seems to me these should have nearly identical performance.
I ran this on Win10/Py3.6/PyTorch1.01/CUDA9
And to save you a tiny bit of code digging, here are the hyperparams:
- lr=0.01
- momentum=0.5
- batch_size=64