3

I am trying to implement mini batch training to my neural network instead of the "online" stochastic method of updating weights every training sample.

I have developed a somewhat novice neural network in C whereby i can adjust the number of neurons in each layer , activation functions etc. This is to help me understand neural networks. I have trained the network on mnist data set but it takes around 200 epochs to get down do an error rate of 20% on the training set which seams very poor to me. I am currently using online stochastic gradient decent to train the network. What i would like to try is use mini batches instead. I understand the concept that i must accumulate and average the error from each training sample before i propagate the error back. My problem comes in when i want to calculate the changes i must make to the weights. To explain this better consider a very simple perceptron model. One input, one hidden layer one output. To calculate the change i need to make to the weight between the input and the hidden unit i will use this following equation:

∂C/∂w1= ∂C/∂O*∂O/∂h*∂h/∂w1

If you do the partial derivatives you get:

∂C/∂w1= (Output-Expected Answer)(w2)(input)

Now this formula says that you need to multiply the back propogated error by the input. For online stochastic training that makes sense because you use 1 input per weight update. For minibatch training you used many inputs so which input does the error get multiplied by? I hope you can assist me with this.

void propogateBack(void){


    //calculate 6C/6G
    for (count=0;count<network.outputs;count++){
            network.g_error[count] = derive_cost((training.answer[training_current])-(network.g[count]));
    }



    //calculate 6G/6O
    for (count=0;count<network.outputs;count++){
        network.o_error[count] = derive_activation(network.g[count])*(network.g_error[count]);
    }


    //calculate 6O/6S3
    for (count=0;count<network.h3_neurons;count++){
        network.s3_error[count] = 0;
        for (count2=0;count2<network.outputs;count2++){
            network.s3_error[count] += (network.w4[count2][count])*(network.o_error[count2]);
        }
    }


    //calculate 6S3/6H3
    for (count=0;count<network.h3_neurons;count++){
        network.h3_error[count] = (derive_activation(network.s3[count]))*(network.s3_error[count]);
    }


    //calculate 6H3/6S2
    network.s2_error[count] = = 0;
    for (count=0;count<network.h2_neurons;count++){
        for (count2=0;count2<network.h3_neurons;count2++){ 
            network.s2_error[count] = += (network.w3[count2][count])*(network.h3_error[count2]);
        }
    }



    //calculate 6S2/6H2
    for (count=0;count<network.h2_neurons;count++){
        network.h2_error[count] = (derive_activation(network.s2[count]))*(network.s2_error[count]);
    }


    //calculate 6H2/6S1
    network.s1_error[count] = 0;
    for (count=0;count<network.h1_neurons;count++){
        for (count2=0;count2<network.h2_neurons;count2++){
            buffer += (network.w2[count2][count])*network.h2_error[count2];
        }
    }


    //calculate 6S1/6H1
    for (count=0;count<network.h1_neurons;count++){
        network.h1_error[count] = (derive_activation(network.s1[count]))*(network.s1_error[count]);

    }


}





void updateWeights(void){


    //////////////////w1
    for(count=0;count<network.h1_neurons;count++){
        for(count2=0;count2<network.inputs;count2++){
            network.w1[count][count2] -= learning_rate*(network.h1_error[count]*network.input[count2]);
        }

    }





    //////////////////w2
    for(count=0;count<network.h2_neurons;count++){
        for(count2=0;count2<network.h1_neurons;count2++){
            network.w2[count][count2] -= learning_rate*(network.h2_error[count]*network.s1[count2]);
        }

    }



    //////////////////w3
    for(count=0;count<network.h3_neurons;count++){
        for(count2=0;count2<network.h2_neurons;count2++){
            network.w3[count][count2] -= learning_rate*(network.h3_error[count]*network.s2[count2]);
        }

    }


    //////////////////w4
    for(count=0;count<network.outputs;count++){
        for(count2=0;count2<network.h3_neurons;count2++){
            network.w4[count][count2] -= learning_rate*(network.o_error[count]*network.s3[count2]);
        }

    }
}

The code i have attached is how i do the online stochastic updates. As you can see in the updateWeights() function the weight updates are based on the input values (dependent on the sample fed in) and the hidden unit values (also dependent on the input sample value fed in). So when i have the minibatch average gradient that i am propogating back how will i update the weights? which input values do i use?

bruno
  • 32,421
  • 7
  • 25
  • 37
C Geeeee
  • 71
  • 5

2 Answers2

4

Ok so i figured it out. When using mini batches you should not accumulate and average out the error at the output of the network. Each training examples error gets propogated back as you would normally except instead of updating the weights you accumulate the changes you would have made to each weight. When you have looped through the mini batch you then average the accumulations and change the weights accordingly.

I was under the impression that when using mini batches you do not have to propogate any error back until you have looped through the mini batch. I was wrong you still need to do that the only difference is you only update the weights once you have looped through your mini batch size.

C Geeeee
  • 71
  • 5
  • 1
    I've currently your same problem, and what you said now make the things more logical. What I don't get is that if you have to calculate the weights anyway, per each operation, how is it possible that minibatch is more performant than the online back propagation? – Andrea Catania May 25 '19 at 07:28
  • Andrea Catania very good question to ask. I had the exact same question. There are two performance benefits when using mini batches. The first being that you can forward propogate and back propogate each training sample independently of one another. For example, when you doing online stochastic training, one sample gets propogated forward, then gradients are propogated backwards, then weights are updated. Only until the weights are updated, can you begin to propogate the next training sample forward. This limits you to only training 1 sample at a time which is a serial process. – C Geeeee May 26 '19 at 08:41
  • 2
    With mini batches, no weights are updated until all the samples in the mini batch have been propogated forward and gradients have been propogated back (weights not updated) so in this instance the state of your network (weight values) does not change until all mini batch samples have been processed. This allows you to run as many samples as you want or can in parallel using multiple CPU or GPU threads. The forward and backwards passes become independent of one another as the network state remains the same. – C Geeeee May 26 '19 at 08:44
  • 2
    The other performance benefit is that you are only updating your weights once every N times. N being the size of your mini batch. As N becomes larger so does the performance benefit – C Geeeee May 26 '19 at 08:45
  • Perfect and really clear explanation! This post is very important. Many thanks man. – Andrea Catania May 26 '19 at 09:47
0

For minibatch training you used many inputs so which input does the error get multiplied by?

"Many inputs" this is a proportion of the dataset size N, which typically segments your data into sizes which are not too large to fit into memory. DL needs Big Data and the full batch cannot fit into most computer systems to process in one go and therefore the mini-batch is necessary.

The error which gets backpropagated is the sum or average error calculated for the data samples in your current mini-batch $X^{{t}}$ which is of size M where $M<N$, $J^{{t}} = 1/m \sum_1^M ( f(x_m^{t})-y_m^{t} )^2$. This is the sum of the squared distances to the target across samples in the batch 't'. This is the forward step and then the backwards propagation of this error is made using the chain rule through the 'neurons' of the network; using this single value of the error for the whole batch. The update of the parameters is based upon this value for this mini-batch.

There are variations to how this scheme is implemented but if you consider your idea of using "many inputs" in the calculation of the parameter update using multiple input samples from the batch, we are averaging over multiple gradients to smooth over the gradient in comparison to stochastic gradient descent.

Vass
  • 2,682
  • 13
  • 41
  • 60