15

I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations.
What can I do to avoid this problem?

Community
  • 1
  • 1
Shai
  • 111,146
  • 38
  • 238
  • 371

2 Answers2

13

You can change the iter_size in the solver parameters. Caffe accumulates gradients over iter_size x batch_size instances in each stochastic gradient descent step. So increasing iter_size can also get more stable gradient when you cannot use large batch_size due to the limited memory.

Liang Xiao
  • 1,490
  • 2
  • 14
  • 21
4

As stated in this post, the batch size is not a problem in theory (the efficiency of stochastic gradient descent has been proven with a batch of size 1). Make sure you implement your batch correctly (the samples should be randomly picked over your data).

Community
  • 1
  • 1
Hatim Khouzaimi
  • 515
  • 4
  • 11
  • 2
    indeed it's a nice theoretical result, but in practice, especially when the net is large and involves many parameters, one might still prefer using larger batch size. – Shai Apr 10 '16 at 09:09
  • Can you provide a little more details about your implementation? Number of parameters? The maximum batch size you can use? – Hatim Khouzaimi Apr 10 '16 at 09:15
  • I am trying to learn a recurrent model: therefore, the batch size is a trade off between the number of time steps I can unroll and the number of independent sequences I can process. The more time steps I include, the fewer sequences I can process and thus the noise in the gradient estimation raises. – Shai Apr 10 '16 at 09:50
  • You might want to read this: http://research.microsoft.com/pubs/192769/tricks-2012.pdf. – Hatim Khouzaimi Apr 10 '16 at 10:19