5

I am using Tensorflow DNNRegressor Estimator model for making a neural network. But calling estimator.train() function is giving output as follows:

enter image description here

I.e. my loss function is varying a lot with every step. But as far as I know, my loss function should decrease with no of iterations. Also, find the attached screenshot for Tensorboard Visualisation for loss function:

enter image description here

The doubts I'm not able to figure out are:

  • Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?
  • If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations? And In my knowledge that is the value we should look at while training a dataset.
  • If this is overall loss value, then why is it fluctuating so much? Am I missing something?
Maxim
  • 52,561
  • 27
  • 155
  • 209
user3457384
  • 593
  • 6
  • 14

2 Answers2

2

First of all, let me point out that tf.contrib.learn.DNNRegressor uses a linear regression head with mean_squared_loss, i.e. simple L2 loss.

Whether it is overall loss function value (combined loss for every step processed till now) or just that step's loss value?

Each point on a chart is the value of a loss function on the last step after learning so far.

If it is that step's loss value, then how to get value of overall loss function and see its trend, which I feel should decrease with increasing no of iterations?

There's no overall loss function, probably you mean a chart how the loss changed after each step. That's exactly what tensorboard is showing to you. You are right, its trend is not downwards, as it should. This indicates that your neural network is not learning.

If this is overall loss value, then why is it fluctuating so much? Am I missing something?

A common reason for the neural network not learning is poor choice of hyperparameters (though there are many more mistakes you can possibly make). For example:

  • the learning rate is too large
  • it's also possible that the learning rate is too small, which means that the neural network is learning, but very very slowly, so that you can't see it
  • weights initialization is probably too large, try to decrease it
  • batch size may be too large as well
  • you're passing wrong labels for the inputs
  • training data contains missing values, or unnormalized
  • ...

What I usually do to check if the neural network is at least somehow working is reduce the training set to few examples and try to overfit the network. This experiment is very fast, so I can try various learning rates, initialization variance and other parameters to find a sweet spot. Once I have a steady decreasing loss chart, I go on with a bigger set.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • 1
    thanks for sharing. It was very helpful for going forward. I already figured out that DNNRegressor uses mean_squared_loss function and by the way I am using tf.estimator.DNNRegressor for my model. I have one more doubt, do specifying the batch_size parameter in input_function size will have any effect in achieving minimum accuracy. I am getting a feeling that, when I am using batch_size = "input_data_set_size" then my algo is converging much better as compared to using 10 or 100 as my batch_size. Any suggestions on that? – user3457384 Oct 05 '17 at 11:14
  • 1
    Batch size is an important hyperparameter and it can affect the performance. Usually researches set it as large as possible so that it fits in GPU memory, but there are known cases, when smaller batches make DNN learn faster. In general, it's one of many parameters you might want to tune for optimal performance - https://stackoverflow.com/questions/41860817/hyperparameter-optimization-for-deep-learning-structures-using-bayesian-optimiza/46318446 – Maxim Oct 05 '17 at 11:37
  • 1
    Thanks for the response. Another doubt I am having: Is Normalization necessary for a NN? I mean will it give wrong answer or lesser accuracy if numerical inputs are not normalized? What I have read till now is that unnormalized inputs will take time to converge, but they will never give wrong answers or lesser accuracy. Am I thinking correct or not? – user3457384 Oct 05 '17 at 13:41
  • It may take longer time to converge, or not converge at all. Until the network is trained, it may give wrong results, thus worse accuracy. But you have a fixed time to train it, so you are only interested in cases when the network learns relatively fast. So, yes, normalization is important, especially in linear regression. – Maxim Oct 05 '17 at 13:48
  • Earlier I used to run this program on CPU. but it was taking some more time to run for large computations. But now I have rented a GPU enabled sever on aws ie (p2.xlarge). But now the problem I m facing is that my GPU memory is being used. BUt Volatile GPU-Util is still 1%. Not able to figure out if my GPU is being used to fullest or not?? PS:- I m using the same estimator code for GPU I was using on CPU, without any modification, as I read that Estimators themselves take care for GPU. Pls help!! – user3457384 Oct 09 '17 at 12:33
  • @user3457384 this is a different topic that needs more details and a separate discussion. Write another question with some diagnostics - https://serverfault.com/questions/395455/how-to-check-gpu-usages-on-aws-ec2-gpu-instance - I or anybody else will take a look at this – Maxim Oct 10 '17 at 09:29
  • I have already written the question for this. https://stackoverflow.com/questions/46648484/how-to-make-best-use-of-gpu-for-tensorflow-estimators and explained every bit of detail in it. Pls have a look. – user3457384 Oct 10 '17 at 09:51
0

Though previous comment is very informative and good, it doesn't quite address your issue. When you instantiate DNNRegressor, add: loss_reduction=tf.losses.Reduction.MEAN in the constructor, and you'll see your average loss, converges.

estimator = tf.estimator.DNNRegressor(      
    feature_columns=feat_clmns, 
    hidden_units=[32, 64, 32],
    weight_column=weight_clmn,
    **loss_reduction=tf.losses.Reduction.MEAN**