I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations.
What can I do to avoid this problem?
Asked
Active
Viewed 3,266 times
15
-
2related: http://stats.stackexchange.com/q/201775/66467 – Shai May 11 '16 at 05:57
2 Answers
13
You can change the iter_size
in the solver parameters.
Caffe accumulates gradients over iter_size
x batch_size
instances in each stochastic gradient descent step.
So increasing iter_size
can also get more stable gradient when you cannot use large batch_size due to the limited memory.

Liang Xiao
- 1,490
- 2
- 14
- 21
4
As stated in this post, the batch size is not a problem in theory (the efficiency of stochastic gradient descent has been proven with a batch of size 1). Make sure you implement your batch correctly (the samples should be randomly picked over your data).

Community
- 1
- 1

Hatim Khouzaimi
- 515
- 4
- 11
-
2indeed it's a nice theoretical result, but in practice, especially when the net is large and involves many parameters, one might still prefer using larger batch size. – Shai Apr 10 '16 at 09:09
-
Can you provide a little more details about your implementation? Number of parameters? The maximum batch size you can use? – Hatim Khouzaimi Apr 10 '16 at 09:15
-
I am trying to learn a recurrent model: therefore, the batch size is a trade off between the number of time steps I can unroll and the number of independent sequences I can process. The more time steps I include, the fewer sequences I can process and thus the noise in the gradient estimation raises. – Shai Apr 10 '16 at 09:50
-
You might want to read this: http://research.microsoft.com/pubs/192769/tricks-2012.pdf. – Hatim Khouzaimi Apr 10 '16 at 10:19