How can I use the GPUs more effectively on an AWS g5dn.metal instance running llama-2?

Question

With the release of llama-2 I wanted to try this myself. To see how fast I can make this, I fired up (from time to time, briefly) a g5dn.metal instance with 96 CPU cores and 8 GPU cards. I used these two articles to start:

https://medium.com/@sasika.roledene/unlocking-llm-running-llama-2-70b-on-a-gpu-with-langchain-561adc616b16

https://dev.to/timesurgelabs/how-to-run-llama-2-on-anything-3o5m

I'm using that llama.cpp directly, compiled with

LLAMA_CUBLAS=1 make CUDA_DOCKER_ARCH=all

and I can see this working. This is the command I am running right now for testing:

./main -t 4 -ngl 800 -gqa 8 -m /llm2/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_1.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: tell me a story about llamas. \n### Response: "

I am monitoring the resource use while this runs, and my goal is to get this very expensive instance ($7.60 per hour) to deploy all its resources on this task to squeeze out as much work load from it as possible, reduce idle time and idle % from all its sub-systems (CPU, GPU, memory, IO) as much as possible.

I am monitoring CPU utilization with top -H and GPU with watch -n nvidia-smi. Of course I am also doing iostat 1 and vmstat 1 where I see the immense disk read activity during (first) start and I also verify with bi/bo that there is no thrashing or ongoing paging going on (i.e., all data fits into RAM).

I have never been able to get the GPU utilization over 37%, no matter what tuning options I am choosing, and only with the 70B parameter model (GGML, quantized at q4 -- 4 bit?). The 13B model can't push the GPU use over 25% and the 7B hardly over 20%.

Here is what I found

-t 4 - the number of CPU threads used. I find that a high value (like 96 for one per CPU I have available) is actually making it (much) slower. I can't tell if -t 1 always produces the fastest result or if a modest number here helps. But definitely a high number makes things worse. Also, leaving this argument not specified means it will use all available CPUs for a slow result.
-gqa 8 - this is required for the 70B parameter GGML file from TheBloke. It has to do with the way the binary file is laid out.
-ngl 800 - when I do not specify -ngl then while it will apparently utilize the 8 GPU cards, they aren't actually running, they stay at 0% utilization the entire run, which is very slow. I have found that going from 1 to 8 to 80 might make things faster and 800 even faster, but while I can set the number apparently without limit, it's not getting any faster from there, and possibly, who knows, slower even.
-ts 3,1 - when I first tried with -ngl number greater than 1 I noticed that no GPU was being used in OpenCL mode, but anyway, I think in CUDA mode it's not necessary.

I have followed the facebookresearch/llama2 README and set up miniconda and stuff like that, where there is this conda activate gpu which gets me from (base) mode into (gpu) mode, but I don't need to do that any more when I just use the llama.cpp main program compiled as specified before.

My question is: why can't I come up with any combination of the -t and -ngl parameters which gets my GPU utilization % higher than 37%? And answer token production limited to about 8 tokens per second. I am paying so much for this AWS metal instance, I need to be sure I am maxing out all the resources. And what the heck am I supposed to do with the 90 unused CPUs on that box (mine bitcoins to pay for the usage hours? -- hey, it's a joke, but it pains me to have so many CPUs that if I deploy them with working threads slow down the performance of my primary use case.)

I wonder if the GPU usage can't go higher because the VRAM isn't fully used either, or rather, only the bigger (70B) model will get GPU use over 30%. This might mean that if I use a differently quantized model and a different memory layout maybe (that -gqa parameter) then only perhaps I could push GPU utilization higher?

How can I use the GPUs more effectively on an AWS g5dn.metal instance running llama-2?

0 Answers0