There are several optimizers in training of neural network. But the Momentum and SGD seem always better than adaptive methods.
Now I am writing a program in tensorflow to reproduce the results of others. They use momentum to train in pylearn2
. But there are several parameters: momentum factor, weight scale, bias scale. They assign the weight scale as the weight of dropout layers.
When I train my network I use Momentum. However, the result seems too hard to train, and the loss is always high. The result seems not bad when I use adam to train, but the result is worse than his in 0.00X.
I want to know how to tune Momentum optimizer. And I also want to know the reason why my program doesn't work well.