Why do higher learning rates in logistic regression produce NaN costs?

Question

Summary

I am building a classifier for spam vs. ham emails using Octave and the Ling-Spam corpus; my method of classification is logistic regression.

Higher learning rates lead to NaN values being calculated for the cost, yet it does not break/decrease the performance of the classifier itself.

My Attempts

NB: My dataset is already normalised using mean normalisation. When trying to choose my learning rate, I started with it as 0.1 and 400 iterations. This resulted in the following plot:

1 - Graph 1

When he lines completely disappear after a few iterations, it is due to a NaN value being produced; I thought this would result in broken parameter values and thus bad accuracy, but when checking the accuracy, I saw it was 95% on the test set (meaning that gradient descent was apparently still functioning). I checked different values of the learning rate and iterations to see how the graphs changed:

2 - Graph 2

The lines no longer disappeared, meaning no NaN values, BUT the accuracy was 87% which is substantially lower.

I did two more tests with more iterations and a slightly higher learning rate, and in both of them, the graphs both decreased with iterations as expected, but the accuracy was ~86-88%. No NaNs there either.

I realised that my dataset was skewed, with only 481 spam emails and 2412 ham emails. I therefore calculated the FScore for each of these different combinations, hoping to find the later ones had a higher FScore and the accuracy was due to the skew. That was not the case either - I have summed up my results in a table:

3 - Table

So there is no overfitting and the skew does not seem to be the problem; I don't know what to do now!

~~The only thing I can think of is that my calculations for accuracy and FScore are wrong, or that my initial debugging of the line 'disappearing' was wrong.~~

EDIT: This question is crucially about why the NaN values occur for those chosen learning rates. So the temporary fix I had of lowering the learning rate did not really answer my question - I always thought that higher learning rates simply diverged instead of converging, not producing NaN values.

My Code

My main.m code (bar getting the dataset from files):

numRecords = length(labels);

trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);

featureData = normalise(data);

featureData = [ones(numRecords, 1), featureData];

numFeatures = size(featureData, 2);

featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);

labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);

paramStart = zeros(numFeatures, 1);

learningRate = 0.0001;
iterations = 400;

[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);

threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);

My gradDescent.m code:

function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)

x_axis = [];
J_axis = [];
J_CV = [];

params = paramStart;

for i=1:iterations,
  [cost, grad] = costFunction(features, labels, params);
  [cost_CV] = costFunction(featuresCV, labelsCV, params);

  params = params - (learningRate.*grad);

  x_axis = [x_axis;i];
  J_axis = [J_axis;cost];
  J_CV = [J_CV;cost_CV];
endfor

graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");

optimParams = params;
endfunction

My costFunction.m code:

function [cost, grad] = costFunction(features, labels, params)
  numRecords = length(labels);

  hypothesis = sigmoid(features*params);

  cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));

  grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction

My predict.m code:

function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);

predictions = sigmoid(features*params)>threshold;

correct = predictions == labels;

truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);

precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);

accuracy = 100*(sum(correct)/numRecords);
endfunction

Just to nitpick, this is **not** a SVM with a linear kernel. You are performing logistic regression. — rayryeng, Feb 11 '19 at 01:53
Ah, thanks for the nitpick! I must confess I'm not too sure about the differences between the two, so I appreciate the correction. I have edited my answer to reflect that :). Feel free to edit if there are any other things that I have missed/gotten wrong. — Zac G, Feb 11 '19 at 20:47
No problem. The differences are beyond this scope, but the crucial difference is how logistic regression and SVMs handle the case with perfectly separable data. Logistic regression fails to converge whereas SVMs find the most optimal hyperplane and thus the best support vectors to do so. In addition, the cost function between both methods is different. SVMs require constrained optimization using quadratic programming to minimize the cost function whereas logistic regression can use simple gradient descent. The one you are optimizing is binary cross-entropy, not the SVM with a linear kernel. — rayryeng, Feb 11 '19 at 20:53
I've also noticed that you have decreased the learning rate while increasing the number of iterations noting the decrease in accuracy. Have you tried keeping the learning rate the same while increasing the number of iterations? It seems that you're changing two hyperparameters here simultaneously so it's unclear as to what the root cause is in this decreased accuracy. Also, the "disappearing" is most likely due to the cost function producing `NaN` values. This can happen if the probabilities produced in the classification (i.e. the sigmoid values) are close to zero (i.e. `log(0)`) (continued). — rayryeng, Feb 11 '19 at 21:33
You can check to see if you have `NaN` values in your cost function values by seeing if either one of these is true: `any(isnan(J_axis))`, `any(isnan(J_CV))`. If any of them evaluates to `true`, then it's a problem with how the cost function is being evaluated. Please see this post on how to modify the cross-entropy cost function to avoid `NaN` values: https://stackoverflow.com/questions/35419882/cost-function-in-logistic-regression-gives-nan-as-a-result/35422981#35422981 — rayryeng, Feb 11 '19 at 21:37
@rayryeng You're right about there being NaN values being produced from the cost function, I completely forgot to include that in my question (silly me), I'll edit it to include the NaN. I'll check out the question/answer too. — Zac G, Feb 12 '19 at 10:55
Possible duplicate of [Cost function in logistic regression gives NaN as a result](https://stackoverflow.com/questions/35419882/cost-function-in-logistic-regression-gives-nan-as-a-result) — rayryeng, Feb 13 '19 at 14:23

score 0 · Accepted Answer · edited Feb 12 '19 at 16:35

Credit where it's due:

A big help here was this answer: https://stackoverflow.com/a/51896895/8959704 so this question is kind of a duplicate, but I didn't realise it, and it isn't obvious at first... I will do my best to try to explain why the solution works too, to avoid simply copying the answer.

Solution:

The issue was in fact the 0*log(0) = NaN result that occurred in my data. To fix it, in my calculation of the cost, it became:

cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis+eps(numRecords, 1)));

(see the question for the variables' values etc., it seems redundant to include the rest when just this line changes)

Explanation:

The eps() function is defined as follows:

Return a scalar, matrix or N-dimensional array whose elements are all eps, the machine precision.

More precisely, eps is the relative spacing between any two adjacent numbers in the machine’s floating point system. This number is obviously system dependent. On machines that support IEEE floating point arithmetic, eps is approximately 2.2204e-16 for double precision and 1.1921e-07 for single precision.

When called with more than one argument the first two arguments are taken as the number of rows and columns and any further arguments specify additional matrix dimensions. The optional argument class specifies the return type and may be either "double" or "single".

So this means that adding this value onto the value calculated by the Sigmoid function (which was previously so close to 0 it was taken as 0) will mean that it is the closest value to 0 that is not 0, making the log() not return -Inf.

When testing with the learning rate as 0.1 and iterations as 2000/1000/400, the full graph was plotted and no NaN values were produced when checking.

**Graph 1 now**

NB: Just in case anyone was wondering, the accuracy and FScores did not change after this, so the accuracy really was that good despite the error in calculating the cost with a higher learning rate.