-1

Consider the training process of deep FF neural network using mini-batch gradient descent. As far as I understand, at each epoch of the training we have different random set of mini-batches. Then iterating over all mini batches and computing the gradients of the NN parameters we will get random gradients at each iteration and, therefore, random directions for the model parameters to minimize the cost function. Let's imagine we fixed the hyperparameters of the training algorithm and started the training process again and again, then we would end up with models, which completely differs from each other, because in those trainings the changes of model parameters were different.

1) Is it always the case when we use such random based training algorithms?

2) If it is so, where is the guaranty that training the NN one more time with the best hyperparameters found during the previous trainings and validations will yield us the best model again?

3) Is it possible to find such hyperparameters, which will always yield the best models?

  • 1
    This isn't a bad question but it's fairly theoretical and I'm not sure the best fit for SO as it doesn't seem within the scope of programming (e.g how to do something in Tensorflow or Pytorch) – paisanco Jan 12 '19 at 15:00
  • 1
    The rationale of the answer here might be helpful: [Machine learning algorithm score changes without any change in data or step](https://stackoverflow.com/questions/53922960/machine-learning-algorithm-score-changes-without-any-change-in-data-or-step) – desertnaut Jan 13 '19 at 20:07

1 Answers1

0

Neural Network are solving a optimization problem, As long as it is computing a gradient in right direction but can be random, it doesn't hurt its objective to generalize over data. It can stuck in some local optima. But there are many good methods like Adam, RMSProp, momentum based etc, by which it can accomplish its objective.

Another reason, when you say mini-batch, there is at least some sample by which it can generalize over those sample, there can be fluctuation in the error rate, and but at least it can give us a local solution.

Even, at each random sampling, these mini-batch have different-2 sample, which helps in generalize well over the complete distribution.

For hyperparameter selection, you need to do tuning and validate result on unseen data, there is no straight forward method to choose these.

Ankish Bansal
  • 1,827
  • 3
  • 15
  • 25