Hard to understand Caffe MNIST example

Question

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html

I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt

As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.

It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.

As a site note, another doubt for me is the max_iter in this solver: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt

We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration? Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.

I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.

Thanks.

The identity y=x can also be considered an activation function, with a derivative equal to 1. Such activation layer would just copy values in a forward pass, and multiply values by 1 in backward pass. So it can be omitted. You can use pretty much any monotone function as activation function in Backpropagation algorithm. — Ivan Kuckir, Jan 07 '20 at 17:39

score 1 · Accepted Answer · answered Feb 22 '16 at 12:10

1

Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.

answered Feb 22 '16 at 12:10

Harsh Wardhan

2,110
10
36
51

1

you can read [here](http://stackoverflow.com/a/33786620/1714410) more about `batch_size` and `max_iter` in `solver.prototxt`. – Shai Feb 22 '16 at 14:42
Hey, can you explain about the first problem why caffe activation layers are not there after convolutional layer. – hunch Jun 17 '16 at 09:26

score 1 · Answer 2 · edited Sep 18 '17 at 21:22

For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.

For reference, in Le Cun's 2001 paper, it states:

As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...

F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).

MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.

Hard to understand Caffe MNIST example

2 Answers2

Linked