With the release of llama-2 I wanted to try this myself. To see how fast I can make this, I fired up (from time to time, briefly) a g5dn.metal instance with 96 CPU cores and 8 GPU cards. I used these two articles to start:
https://dev.to/timesurgelabs/how-to-run-llama-2-on-anything-3o5m
I'm using that llama.cpp directly, compiled with
LLAMA_CUBLAS=1 make CUDA_DOCKER_ARCH=all
and I can see this working. This is the command I am running right now for testing:
./main -t 4 -ngl 800 -gqa 8 -m /llm2/Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_1.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: tell me a story about llamas. \n### Response: "
I am monitoring the resource use while this runs, and my goal is to get this very expensive instance ($7.60 per hour) to deploy all its resources on this task to squeeze out as much work load from it as possible, reduce idle time and idle % from all its sub-systems (CPU, GPU, memory, IO) as much as possible.
I am monitoring CPU utilization with top -H
and GPU with watch -n nvidia-smi
. Of course I am also doing iostat 1
and vmstat 1
where I see the immense disk read activity during (first) start and I also verify with bi/bo that there is no thrashing or ongoing paging going on (i.e., all data fits into RAM).
I have never been able to get the GPU utilization over 37%, no matter what tuning options I am choosing, and only with the 70B parameter model (GGML, quantized at q4 -- 4 bit?). The 13B model can't push the GPU use over 25% and the 7B hardly over 20%.
Here is what I found
-t 4
- the number of CPU threads used. I find that a high value (like 96 for one per CPU I have available) is actually making it (much) slower. I can't tell if -t 1 always produces the fastest result or if a modest number here helps. But definitely a high number makes things worse. Also, leaving this argument not specified means it will use all available CPUs for a slow result.-gqa 8
- this is required for the 70B parameter GGML file from TheBloke. It has to do with the way the binary file is laid out.-ngl 800
- when I do not specify -ngl then while it will apparently utilize the 8 GPU cards, they aren't actually running, they stay at 0% utilization the entire run, which is very slow. I have found that going from 1 to 8 to 80 might make things faster and 800 even faster, but while I can set the number apparently without limit, it's not getting any faster from there, and possibly, who knows, slower even.-ts 3,1
- when I first tried with -ngl number greater than 1 I noticed that no GPU was being used in OpenCL mode, but anyway, I think in CUDA mode it's not necessary.
I have followed the facebookresearch/llama2 README and set up miniconda and stuff like that, where there is this conda activate gpu
which gets me from (base) mode into (gpu) mode, but I don't need to do that any more when I just use the llama.cpp main program compiled as specified before.
My question is: why can't I come up with any combination of the -t and -ngl parameters which gets my GPU utilization % higher than 37%? And answer token production limited to about 8 tokens per second. I am paying so much for this AWS metal instance, I need to be sure I am maxing out all the resources. And what the heck am I supposed to do with the 90 unused CPUs on that box (mine bitcoins to pay for the usage hours? -- hey, it's a joke, but it pains me to have so many CPUs that if I deploy them with working threads slow down the performance of my primary use case.)
I wonder if the GPU usage can't go higher because the VRAM isn't fully used either, or rather, only the bigger (70B) model will get GPU use over 30%. This might mean that if I use a differently quantized model and a different memory layout maybe (that -gqa parameter) then only perhaps I could push GPU utilization higher?