I installed StableLM on a GCP VM with these specs:
1 x NVIDIA Tesla P4, 8 vCPU - 30 GB memory.
And I set the model params llm_int8_enable_fp32_cpu_offload=True
. But it takes too long to answer questions, ~8 minutes. It was faster even when using CPU,~2 mins. I downloaded the repository from the official Github link directly and I'm running the notebook there. Where am I doing wrong? (I installed nvidia and cuda and the code finding nvidia-smi)
Also when I remove llm_int8_enable_fp32_cpu_offload=True
param the code not even working. It throws this error: (I upgraded memory to 16 vCPU, 104GB memory but it still shows this error)