I am making a neural network from scratch in C++. I'm working on a digit classification task using the MNIST dataset, and my network is composed of one hidden layer, consisting of 100 nodes, and an output layer with 10 nodes, each corresponding to a digit (0 to 9). To train the network, I'm using backpropagation with the Mean Squared Error (MSE) cost function, where the cost is calculated as (actualNodeActivation - expectedNodeActivation)^2 and the sigmoid activation function.
Regardless of the input data, my network's output activations converge to around 0.1 for all output nodes. This seemed very random at first, but I think I understand what is actually happening. In any given training batch, the average expected activation of an output node will be 0.1. This is because the the expected output of a node will either be 0 or 1, and since in a training batch all numbers are equally likely to come up, the average expected activation comes up to 0.1. The annoying thing is that this actually lowers the total cost of the network, so it thinks it's doing the right thing.
I have checked that the input images are normalized to have pixel values between 0 and 1, ensured proper target label encoding (The target labels (expectedNodeActivation) are correctly one-hot encoded for each digit class), experimented with different learning rates and verified the neural network architecture (The network architecture seems reasonable, and the implementation appears correct).
I know there are a lot of things I could change, such as using the ReLU function instead of the sigmoid one, or using softmax for the output layer, or changing the cost function, but from what I understand, what I have now should work. Sure, the accuracy of the network shouldn't be the greatest, but this is not the behaviour I would expect to see.
Here is all the relevant code for backpropagation:
void NeuralNetwork::backPropagation(std::vector<double> inputs, std::vector<double> expectedOutputs) {
//Run the inputs through the network
calculateOutputs(std::move(inputs));
//Update the gradients of the output layer
std::vector<double> gradientProducts = outputLayer().outputLayerGradientProduct(std::move(expectedOutputs));
outputLayer().calculateGradients(gradientProducts);
//Calculate the gradients for each of the hidden layers
for (int layer = layers.size() - 2; layer >= 0; layer--) {
gradientProducts = layers[layer].hiddenLayerGradientProduct(layers[layer + 1], gradientProducts);
layers[layer].calculateGradients(gradientProducts);
}
}
std::vector<double> Layer::outputLayerGradientProduct(std::vector<double> expectedOutputs) {
std::vector<double> gradientProducts(length());
for (int node = 0; node < length(); node++) {
//Evaluate partial derivatives for current node: cost/activation * activation/weightedInput
gradientProducts[node] = activationSigmoidDerivative(activations[node])
* calculateCostDerivative(activations[node], expectedOutputs[node]);
}
return gradientProducts;
}
std::vector<double> Layer::hiddenLayerGradientProduct(Layer oldLayer, std::vector<double> oldGradientProducts) {
std::vector<double> gradientProducts(length());
for (int newGradientIndex = 0; newGradientIndex < gradientProducts.size(); newGradientIndex++) {
double gradientProductValue = 0;
for (int oldGradientIndex = 0; oldGradientIndex < oldGradientProducts.size(); oldGradientIndex++) {
//Partial derivative of the weighted input with respect to the input
gradientProductValue += oldLayer.weights[newGradientIndex][oldGradientIndex]
* oldGradientProducts[oldGradientIndex];
}
gradientProducts[newGradientIndex] = activationSigmoidDerivative(gradientProductValue);
}
return gradientProducts;
}
void Layer::calculateGradients(std::vector<double> gradientProducts) {
for (int nodeOut = 0; nodeOut < numNodesOut; nodeOut++) {
for (int nodeIn = 0; nodeIn < numNodesIn; nodeIn++) {
costGradientW[nodeIn][nodeOut] += inputs[nodeIn] * gradientProducts[nodeOut];
}
costGradientB[nodeOut] += gradientProducts[nodeOut];
}
}
//Here I have omitted the function that calls apply gradients, but the learnRate here is not just the learnRate, but learnRate divided by the size of the training batch. This way the gradients are averaged.
void Layer::applyGradients(double learnRate) {
for (int nodeOut = 0; nodeOut < numNodesOut; nodeOut++) {
biases[nodeOut] -= costGradientB[nodeOut] * learnRate;
costGradientB[nodeOut] = 0;
for (int nodeIn = 0; nodeIn < numNodesIn; nodeIn++) {
weights[nodeIn][nodeOut] -= costGradientW[nodeIn][nodeOut] * learnRate;
costGradientW[nodeIn][nodeOut] = 0;
}
}
}
How to fix this?