Figuring out general specs for running LLM models

Question

I have three questions :

Given count of LLM parameters in Billions, how can you figure how much GPU RAM do you need to run the model ?

If you have enough CPU-RAM (i.e. no GPU) can you run the model, even if it is slow

Can you run LLM models (like h2ogpt, open-assistant) in mixed GPU-RAM and CPU-RAM ?

score 6 · Answer 1 · answered May 18 '23 at 08:11

How much vram ?

Inference often runs in float16, meaning 2 bytes per parameter. For a 7B parameter model, you need about 14GB of ram to run it in float16 precision. Usually training/finetuning is done in float16 or float32. Inference usually works well right away in float16. In some cases, models can be quantized and run efficiently in int8 or smaller.

Can you run the model on CPU assuming enough RAM ?

Usually yes, but depends on the model and the library. It can happen that some layers are not implemented for CPU.

Can you run in mixed mode CPU/GPU ?

Many libraries now support running some of the layers on CPU and others on GPU. For example Huggingface transformers library support auto mapping layers to all your devices, meaning it will try to fill your GPUs to the maximum and offload the rest to your CPU. For this set device_map to auto when loading the model.

from transformers import AutoModelForCausalLM, AutoTokenizer   
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/stablelm-7b-sft-v7-epoch-3")
model = AutoModelForCausalLM.from_pretrained("OpenAssistant/stablelm-7b-sft-v7-epoch-3",
                                             device_map="auto")

score 0 · Answer 2 · edited Aug 29 '23 at 09:51

How do you calculate the amount of RAM needed? I'm assuming that you mean just inference, no training.

The paper "Reducing Activation Recomputation in Large Transformer Models" has good information on calculating the size of a Transformer layer.

b: batchsize
s: sequence length
l: layers
a: attention heads
h: hidden dimensions
p: bytes of precision

activations per layer = s*b*h*(34 +((5*a*s)/h))

The paper calculated this at 16bit precision. The above is in bytes, so if we divide by 2 we can later multiply by the number of bytes of precision used later.

activations = l * (5/2)*a*b*s^2 + 17*b*h*s #divided by 2 and simplified

total = p * (params + activations)

Let's look at llama2 7b for an example:

params = 7*10^9

p = 32   #precision
b = 1    #batchsize 
s = 2048 #sequence length
l = 32   #layers
a = 32   #attention heads
h = 4096 #hidden dimension

activations => 10,880,024,576
p * (activations + params) => about 66 GB

Note you can drastically reduce the memory needed by quantization. At bit quantization you get that down to a little over 8GB.

I hope that helps and that I didn't miss anything important.

Nice source, but your formula would be clearer if specified that the output of `p * (activations + params)` is in bits, and needs to be converted to bytes (ie divided by 8) to get the GB measurement you report. Also, the paper you are referring to is specific to training, where the activations are needed to compute gradients. At inference, there would be no need to keep past activations in the forward pass. Can you comment on it? — joeDiHare, Sep 01 '23 at 19:17

Figuring out general specs for running LLM models

2 Answers2