1

I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.

I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:

Check failed: error == cudaSuccess (2 vs. 0) out of memory

After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt

Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.

Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10). Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.

How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?

Community
  • 1
  • 1
bohr
  • 631
  • 2
  • 9
  • 29

2 Answers2

1

To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.

To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.

The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).

Community
  • 1
  • 1
David Stutz
  • 2,578
  • 1
  • 17
  • 25
  • awesome answer and thanks for the reference. So, if I can understand correctly, in general, batch size determines how many example we look at before making a weight update. In other words, the smaller or lower the batch size is the noisier the training signal is going to be, right? If so, does it mean when we reduce the batch size, we should increase the learning rate? – bohr Sep 03 '16 at 03:11
  • Also, speaking of the two aforementioned alternatives - **reduce the input size or reduce the model size** - I have a little concern about this. As far as I know the input size for some model is fixed and unique, so modifying it will result in an error. On the other hand, since I am using the existing model like vgg or googlenet, how significant will it be (in terms of the speed of computation and model quality) , if I discard some of the layers or reducing the size of the fully connected layers? – bohr Sep 03 '16 at 03:27
  • Yes, your understanding of the batchsize is correct. But, reducing the batchsize does not necessarily mean to increase the learning rate or vice versa. For you second question: Reducing the input size is not an option if the dataset if fixed (e.g. the image size), but might not cause an error in the VGG model given that you retrain (and, thus resize) the fully connected layers (only the fully connected layers are restricted to the original input size). Reducing the model size is for example possible by reducing the size of the fully connected layers (while keeping the convolutional layers). – David Stutz Sep 03 '16 at 23:10
1

another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):

 build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3

However if you are using the python interface, running with multiple GPUs is not yet supported.

I can point out some general hints to answer your question on the batchsize: - The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging. - each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation. - Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs. - stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").

Amir
  • 2,259
  • 1
  • 19
  • 29
  • thank you for your answer. using multiple gpus would be a good choice, yet, still try to figure out how to optimize all the parameters so that I can use only one gpu (due to some considerations). Also, thanks so much for your explanation. Brief yet informative. However, I have a question, based on your experience, is there any rule of thumb in specifying the hyper-parameters, such as batch-size, learning rate and the so forth? – bohr Sep 04 '16 at 14:41
  • Concerning learning rate, you can find many rules of thumb in the literature: for example see the VGG paper here: https://arxiv.org/pdf/1409.1556.pdf (Section 3.1), or the Caffe documentation here: http://caffe.berkeleyvision.org/tutorial/solver.html (Section "Rules of thumb for setting the learning rate and momentum"), popular are also the rules of thumb in Alex Krizhevsky's paper: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf (Section 5). The papers usually also report the used batch size and the reduction scheme of the learning rate. – David Stutz Sep 05 '16 at 10:00
  • awesome!! will take sometime to read all the sources. However, I am trying to run the training using all my gpus (`caffe train --solver=solver.prototxt --gpu=0,1,2,3`) but now with googlenet model (in train.prototxt I specify the batch size: 50). Yet, I got the same out of memory problems. @DavidStutz – bohr Sep 06 '16 at 06:19
  • 2
    From the Caffe documentation: NOTE: each GPU runs the batchsize specified in your train_val.prototxt. So if you go from 1 GPU to 2 GPU, your effective batchsize will double. e.g. if your train_val.prototxt specified a batchsize of 256, if you run 2 GPUs your effective batch size is now 512. So you need to adjust the batchsize when running multiple GPUs and/or adjust your solver params, specifically learning rate. See https://github.com/BVLC/caffe/blob/master/docs/multigpu.md – David Stutz Sep 06 '16 at 09:59
  • @DavidStutz, yah, I missed that part. Thanks a lot for your help. Now it is working well. Cheers!! – bohr Sep 07 '16 at 06:09