43

Sometimes I run into a problem:

OOM when allocating tensor with shape

e.g.

OOM when allocating tensor with shape (1024, 100, 160)

Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.

Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?

In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Andrzej Gis
  • 13,706
  • 14
  • 86
  • 130
  • Honestly, from what you've posted just try with 512. If that doesn't work, then half again. You're limited to powers of 2 so keep reducing till it works. It isn't so much 'optimal' batch size as it is 'what fits in memory'. – MikeB2019x Mar 23 '23 at 13:24

5 Answers5

42

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

  • Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  • Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
  • If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
  • Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
  • Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.

Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

You might want also to consult several good posts here in Stack Exchange:

Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

UPDATE (Dec 2017):

There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

UPDATE (Mar 2021):

Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:

The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 5
    It's does not really answer my question. I want the largest batch size possible in terms of my model, which will fit into my GPU memory. – Andrzej Gis Oct 09 '17 at 23:00
  • 1
    Understood. In practice, especially if you use a GPU, the powers of 2 requirement is so limiting that, even if you get an 'optimal' size of, say, 800, you never use it; what you do is start with an n (power of 2) and, if you get an OOM, try with n/2, then with n/4 etc (if not, you try 2*n) - see 4th bullet above – desertnaut Oct 10 '17 at 09:05
  • 2
    Going down with the size if a error occurs is a big nuisance when you're experimenting with hyperparameters and topologies. A generic formula would be great. Even if the result would be rounded to the power of 2. – Andrzej Gis Oct 10 '17 at 09:14
  • I don't see how your excerpts led you to the conclusion that larger is better. Maybe you could pinpoint the exact source that made you conclude this? – Nicolas Gervais Apr 19 '20 at 13:25
  • @NicolasGervais what about the very first bullet, "*Larger batches provide a more accurate estimate of the gradient*"?? – desertnaut Apr 19 '20 at 13:27
  • That might not be as meaningful as you seem to think. Especially in light of [evidence](https://arxiv.org/abs/1804.07612) that is more recent than any of your sources, which _strongly_ argues against batch size over 32. – Nicolas Gervais Apr 19 '20 at 13:33
  • @NicolasGervais That's another matter (answer hasn't been updated since 2017), and not what you asked in the first place. Based on what has been quoted here, I cannot see any inconsistency, as you seem to imply. – desertnaut Apr 19 '20 at 13:36
  • @NicolasGervais that paper on small batch sizes has a lot of weaknesses. Besides the fact that it is not published in any peer reviewed venue, it does not cover much recent work on learning rate schedules. In particular it does not reference any of the work by Leslie N. Smith on one-shot training schedules with very high learning rates, the [Super-Convergence paper](https://arxiv.org/abs/1708.07120) in particular. Tuning the learning rate is essential to training performance, but the authors have punted in favor of a naive linear scaling as batch size increases. – mcskinner Apr 19 '20 at 19:43
  • Don't get me wrong, it's an interesting theoretical tack to take. But it seems like a very narrow view to take in practice. – mcskinner Apr 19 '20 at 19:44
  • On a practical side, I'm [re]training a shallow dnn on a machine with a single GPU. If the batch size is 2048, it takes ~20 min per epoch (~12 epochs to converge). If I set the batch size to 32, the estimated time to converge is 188 hours. On a CPU it's similarly unrealistic time wise. – rodrigo-silveira Jun 23 '21 at 11:27
30

You can estimate the largest batch size using:

Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)

ilan
  • 576
  • 3
  • 6
  • 7
    How do I get the *size of tensors* and the number *trainable parameters*? Aren't you missing the model size in the equation? – Andrzej Gis Oct 10 '17 at 00:11
  • 1
    @gisek the model size is actually the no of training parameters, which in Keras you get with `model.summary()` – desertnaut Oct 10 '17 at 08:53
  • @desertnaut I'm not sure if you're right. If I create a large netowork and feed it with batch_size=1, I also get the the same error. – Andrzej Gis Oct 10 '17 at 09:10
  • Of course - it can certainly happen that the combination of your model size (trainable parameters) and input data size exhaust your memory even with batch_size = 1, especially if you have a small GPU... – desertnaut Oct 10 '17 at 09:29
  • 1
    @desertnaut hehe, I didn't get that "no" stands for "number". Now it makes sense :) – Andrzej Gis Oct 13 '17 at 17:56
  • 2
    What is _size of tensors_ ? I am still confused about that part. – Melike Jul 04 '18 at 15:59
  • 2
    @Melike Each layer has its tensor + one or more weight matrices (usually referred to as trainable parameters). For example: if you're feeding your network with 200x200 RGB images, then the size of your input tensor (in bytes) is [batch size] * 3 * 200 * 200 ( * 4 if you use 64bit integers) – ilan Jul 05 '18 at 11:43
  • @ilan Theoretically your formula makes sense. Have you ever tested it empirically? I am observing the following: For Alexnet with 62 million parameters and a image size of 224x224x3 and a 6GB graphics card, I should be able to fit: (6 GB - (62 Million * 4 bytes)) / (224 * 224 * 3 * 4 bytes) = *9553* as max_batch_size. In practice I am not able to run training with more than batch_size = 512. With 1024 it already crashes. Second example: Resnet-50 has only 25 Million parameters. So I should get an even higher max_batch_size. In practice training crashes with batch_size=128. Please advise. – Alex Aug 07 '18 at 14:59
  • @Alex You should take into account all the tensors, not just the input – ilan Aug 20 '18 at 00:50
  • 1
    @ilan Could you please give an example what tensors you mean? I thought with all the trainable parameters I do take that into consideration? Please correct me if I am wrong. – Alex Aug 27 '18 at 07:25
  • @Alex For each layer your model has to store an input placeholder, one or more weight matrices (trainable or otherwise) and an output placeholder (which may also be the next layer's input). – ilan Aug 30 '18 at 10:43
  • Is it possible to include reference from which paper this was used? – Pam Cesar Aug 24 '22 at 14:22
9

Use the summaries provided by pytorchsummary (pip install) or keras (builtin).

E.g.

from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------

Each instance you put in the batch will require a full forward/backward pass in memory, your model you only need once. People seem to prefer batch sizes of powers of two, probably because of automatic layout optimization on the gpu.

Don't forget to linearly increase your learning rate when increasing the batch size.

Let's assume we have a Tesla P100 at hand with 16 GB memory.

(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 13.93 = 1148.29
rounded to powers of 2 results in batch size 1024
Ario
  • 549
  • 1
  • 8
  • 18
0-_-0
  • 1,313
  • 15
  • 15
3

Here is a function to find batch size for training the model:

def FindBatchSize(model):
    """model: model architecture, that is yet to be trained"""
    import os, sys, psutil, gc, tensorflow, keras
    import numpy as np
    from keras import backend as K
    BatchFound= 16

    try:
        total_params= int(model.count_params());    GCPU= "CPU"
        #find whether gpu is available
        try:
            if K.tensorflow_backend._get_available_gpus()== []:
                GCPU= "CPU";    #CPU and Cuda9GPU
            else:
                GCPU= "GPU"
        except:
            from tensorflow.python.client import device_lib;    #Cuda8GPU
            def get_available_gpus():
                local_device_protos= device_lib.list_local_devices()
                return [x.name for x in local_device_protos if x.device_type == 'GPU']
            if "gpu" not in str(get_available_gpus()).lower():
                GCPU= "CPU"
            else:
                GCPU= "GPU"

        #decide batch size on the basis of GPU availability and model complexity
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
            BatchFound= 64    
        if (os.cpu_count() <16) and (total_params <500000):
            BatchFound= 64  
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
            BatchFound= 32      
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
            BatchFound= 16  
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
            BatchFound= 8       
        if (os.cpu_count() <16) and (total_params >5000000):
            BatchFound= 8    
        if total_params >100000000:
            BatchFound= 1

    except:
        pass
    try:

        #find percentage of memory used
        memoryused= psutil.virtual_memory()
        memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
        if memoryused >75.0:
            BatchFound= 8
        if memoryused >85.0:
            BatchFound= 4
        if memoryused >90.0:
            BatchFound= 2
        if total_params >100000000:
            BatchFound= 1
        print("Batch Size:  "+ str(BatchFound));    gc.collect()
    except:
        pass

    memoryused= [];    total_params= [];    GCPU= "";
    del memoryused, total_params, GCPU;    gc.collect()
    return BatchFound
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Anurag Gupta
  • 485
  • 5
  • 5
  • 3
    Can you please explain the code and why the if conditions point to a specific batch size? Does your code deal with the memory size of each sample? – ias Sep 05 '19 at 18:27
1

I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

see: google colaboratory `ResourceExhaustedError` with GPU

michael
  • 4,377
  • 8
  • 47
  • 73