2

What are the core differences between how GGML, GPTQ and bitsandbytes (NF4) do quantisation?

Which will perform best on:

a) Mac (I'm guessing ggml)

b) Windows

c) T4 GPU

d) A100 GPU

So far, I've run GPTQ and bitsandbytes NF4 on a T4 GPU and found:

fLlama-7B (2GB shards) nf4 bitsandbytes quantisation:
- PPL: 8.8, GPU Mem: 4.7 GB, 12.2 toks.

Llama-7B-GPTQ-4bit-128:
- PPL: 9.3, GPU Mem: 4.8 GB, 21.4 toks.

fLlama-13B (4GB shards) nf4 bitsandbytes quantisation:
- PPL: 8.0, GPU Mem: 8.2 GB, 7.9 toks.

Llama-13B-GPTQ-4bit-128:
- PPL: 7.8, GPU Mem: 8.5 GB, 15 toks.

I've also run ggml on T4 and got 2.2 toks, so it seems much slower - whether I do 3 or 5bit quantisation.

0 Answers0