Navigating hyper-parameters

Question

I was just wondering if someone could provide a good source for me to read on how I should approach choosing hyper-parameters of the solver based on the complexity of my problem.

Basically, I understand that many feel that they are "shooting around in the dark" when it comes to setting and then modifying these parameters and a system or benchmark for choosing parameters based on specific problem/data complexity has escaped me.

If you care to explain your own methodology or simply provide commentary on your source, it would be much appreciated.

One of these "hyper-parameters" is `'weight_decay'`. You can find a thread discussing its role and some "rule of thumb" to setting its value [here](http://stackoverflow.com/q/32177764/1714410). — Shai, Oct 11 '15 at 07:09

Flavio Ferrara · Accepted Answer · 2015-10-06T17:34:06.910

3

Since the hyperparameters we're talking about are related to backpropagation, which is a gradient-based approach, I believe the main reference is Y. Bengio, along with the more classic Lecun et al..

There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the first paper I linked.

Manual search. The researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian Optimization. You can find a video presenting it here.

edited Oct 06 '15 at 17:34

answered Oct 06 '15 at 14:02

Flavio Ferrara

1,644
12
18

The video was great! Great theory. I'll try and keep up to date with Bengio; so glad that the theory of machine learning is being researched and codified. – Aidan Gomez Oct 07 '15 at 18:53

score 0 · Answer 2 · answered Oct 06 '15 at 09:13

0

I think this is the main reference:

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Also take a look at Chapter 5 in: http://neuralnetworksanddeeplearning.com/

answered Oct 06 '15 at 09:13

cgarner

124
4

The paper from Krizhevsky et al. is just an example of choosing (good) hyperparameters. It doesn't provide a methodology or a theorical foundation for their choice, e.g., batch size, learning rate or weight decay. – Flavio Ferrara Oct 06 '15 at 13:57
If there were a theoretical foundation it would be much easier! As far as I can tell it is all try and error or computer aided trial and error. – cgarner Oct 07 '15 at 08:14
Indeed. Stay tuned on Yoshua Bengio's work, his lab is working hard on deep learning theory. – Flavio Ferrara Oct 07 '15 at 16:54

Navigating hyper-parameters

2 Answers2