0

Referring to this answer on choosing number of hidden layers and units in a NN: https://stackoverflow.com/a/10568938/2265724
The post suggests adding the number of hidden units until the generalization error starts to increase.
But my problem is the learning rate. Given a value for the number the hidden unit (i.e. one data point in the graph or one particular architecture e.g. let's say 10 hidden units) how do I set the learning rate and how many epochs to train?
1. use a fixed learning rate (after checking it converges i.e the cost drops) and run for n epochs or until the cost (or validation error) plateaus (if it does drop in a nice asymptotic fashion)
2. as in 1 with early stopping
3. as in 1 or 2, but trying various different learning rates in a certain (linear or log) range
4. as in 3, including learning rate decay
5. as in 3 or 4, including weight decay as regularization, or perhaps better, dropout

The number of parameters increase from 1 to 5. 1 is quickest but doesn't sound satisfactory (why not try other learning rates?). 3-5 are time consuming. Because if I am not happy, I need to try another architecture by increasing the number of hidden units. And repeat until the graph shown in the post is obtained.

Am I understanding and practicing this correctly ?

Community
  • 1
  • 1
ng0323
  • 317
  • 1
  • 6
  • 14

2 Answers2

5

This is a hard problem; there's even a sub-field of machine learning dedicated to exploring this, called hyperparameter optimization.

The most basic method for solving the hyperparameter problem is brute-force search, in which you try systematically varying the hyperparameter settings along a grid ("grid search") and pick the best one. This is pretty slow, and it's also annoying because it seems like there ought to be a better way.

There are a few different schools of thought on improving grid search:

  • Evolutionary methods assign some fitness score to a combination of hyperparameters and then attempt to re-use combinations of parameter settings that have performed well together. The most popular method I've seen recently in this camp is CMA-ES.

  • Bayesian methods attempt to place some sort of prior distribution over the values that the researcher thinks are reasonable for each hyperparameter. Then by evaluating several different hyperparameter settings, you can combine the resulting performance with the prior in a statistically optimal way.

lmjohns3
  • 7,422
  • 5
  • 36
  • 56
0

The Learning Rate used to reach lower generalization error can be problem dependent. From previous experience, the optimum learning rate can differ based on a number of parameters including epoch size, number of learning iterations, number of hidden layers and/or neurons and number and format of the inputs. Trial and Error was often used in order to determine the ideal learning condition for each problem studied.

There are some papers in the past that provided a reasonable starting point for neural net parameters given the amount of training data, hidden layers, neurons and outputs. This may be a good starting point.

Perhaps other dynamic models are available to encourage minimizing generalization error out of local minima. Each problem has its own ideal parameters and requires either tinkering with the parameters or using some form of dynamic or automated model to find the ideal.

Matthew Spencer
  • 2,265
  • 1
  • 23
  • 28
  • 1
    So you're saying the learning rate should be tinkered with i.e. option 1 or 2 above are not satisfactory. Then, how much tinkering i.e. how many n points to try, say in [0.001,1]. In my problem, n = 10 would take a few days. I have seen papers saying "... we trained our neural net with learning rate = 0.01...", it's not clear how much tinkering they did. – ng0323 Aug 29 '14 at 10:20
  • In my own research and publications, I would generally tinker with neural network parameters then report the optimum conditions in the research paper. These parameters were evaluated and discussed in my dissertations, however were kept short in publications. That's not to say that there may be more dynamic models that are available now, but past experience has shown correlations between not only learning rate and generalization error, but also other parameters of the Neural Network. I generally applied process number 3 (linear) which can take time depending on number of tests. – Matthew Spencer Sep 01 '14 at 00:05