Summary
I am building a classifier for spam vs. ham emails using Octave and the Ling-Spam corpus; my method of classification is logistic regression.
Higher learning rates lead to NaN values being calculated for the cost, yet it does not break/decrease the performance of the classifier itself.
My Attempts
NB: My dataset is already normalised using mean normalisation. When trying to choose my learning rate, I started with it as 0.1 and 400 iterations. This resulted in the following plot:
1 - Graph 1
When he lines completely disappear after a few iterations, it is due to a NaN value being produced; I thought this would result in broken parameter values and thus bad accuracy, but when checking the accuracy, I saw it was 95% on the test set (meaning that gradient descent was apparently still functioning). I checked different values of the learning rate and iterations to see how the graphs changed:
2 - Graph 2
The lines no longer disappeared, meaning no NaN values, BUT the accuracy was 87% which is substantially lower.
I did two more tests with more iterations and a slightly higher learning rate, and in both of them, the graphs both decreased with iterations as expected, but the accuracy was ~86-88%. No NaNs there either.
I realised that my dataset was skewed, with only 481 spam emails and 2412 ham emails. I therefore calculated the FScore for each of these different combinations, hoping to find the later ones had a higher FScore and the accuracy was due to the skew. That was not the case either - I have summed up my results in a table:
3 - Table
So there is no overfitting and the skew does not seem to be the problem; I don't know what to do now!
The only thing I can think of is that my calculations for accuracy and FScore are wrong, or that my initial debugging of the line 'disappearing' was wrong.
EDIT: This question is crucially about why the NaN values occur for those chosen learning rates. So the temporary fix I had of lowering the learning rate did not really answer my question - I always thought that higher learning rates simply diverged instead of converging, not producing NaN values.
My Code
My main.m code (bar getting the dataset from files):
numRecords = length(labels);
trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);
featureData = normalise(data);
featureData = [ones(numRecords, 1), featureData];
numFeatures = size(featureData, 2);
featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);
labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);
paramStart = zeros(numFeatures, 1);
learningRate = 0.0001;
iterations = 400;
[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);
threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);
My gradDescent.m code:
function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)
x_axis = [];
J_axis = [];
J_CV = [];
params = paramStart;
for i=1:iterations,
[cost, grad] = costFunction(features, labels, params);
[cost_CV] = costFunction(featuresCV, labelsCV, params);
params = params - (learningRate.*grad);
x_axis = [x_axis;i];
J_axis = [J_axis;cost];
J_CV = [J_CV;cost_CV];
endfor
graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");
optimParams = params;
endfunction
My costFunction.m code:
function [cost, grad] = costFunction(features, labels, params)
numRecords = length(labels);
hypothesis = sigmoid(features*params);
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));
grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction
My predict.m code:
function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);
predictions = sigmoid(features*params)>threshold;
correct = predictions == labels;
truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);
precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);
accuracy = 100*(sum(correct)/numRecords);
endfunction