What are the core differences between how GGML, GPTQ and bitsandbytes (NF4) do quantisation?
Which will perform best on:
a) Mac (I'm guessing ggml)
b) Windows
c) T4 GPU
d) A100 GPU
So far, I've run GPTQ and bitsandbytes NF4 on a T4 GPU and found:
fLlama-7B (2GB shards) nf4 bitsandbytes quantisation:
- PPL: 8.8, GPU Mem: 4.7 GB, 12.2 toks.
Llama-7B-GPTQ-4bit-128:
- PPL: 9.3, GPU Mem: 4.8 GB, 21.4 toks.
fLlama-13B (4GB shards) nf4 bitsandbytes quantisation:
- PPL: 8.0, GPU Mem: 8.2 GB, 7.9 toks.
Llama-13B-GPTQ-4bit-128:
- PPL: 7.8, GPU Mem: 8.5 GB, 15 toks.
I've also run ggml on T4 and got 2.2 toks, so it seems much slower - whether I do 3 or 5bit quantisation.