I have been playing around with oobabooga text-generation-webui on my Ubuntu 20.04 with my NVIDIA GTX 1060 6GB for some weeks without problems. I have been using llama2-chat models sharing memory between my RAM and NVIDIA VRAM. I installed without much problems following the intructions on its repository.
So the next expected step I want is to use the model loader llama-cpp
with its package llama-cpp-python
bindings to play around with it by myself. So using the same miniconda3 environment that oobabooga text-generation-webui uses I started a jupyter notebook and I could make inferences and everything is working well BUT ONLY for CPU.
Like bellow,
from llama_cpp import Llama
llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin",
n_gpu_layers=32, n_threads=6, n_ctx=3584, n_batch=521, verbose=True),
prompt = """[INST] <<SYS>>
Name the planets in the solar system?
<</SYS>>
[/INST]
"""
output = llm(prompt, max_tokens=350, echo=True)
print(output['choices'][0]['text'].split('[/INST]')[-1])
Of course! Here are the eight planets in our solar system, listed in order from closest to farthest from the Sun:
- Mercury
- Venus
- Earth
- Mars
- Jupiter
- Saturn
- Uranus
- Neptune
Note that Pluto was previously considered a planet but is now classified as a dwarf planet due to its small size and unique orbit.
What is wrong? Why can't I offload to gpu like the parameter n_gpu_layers=32
specifies and also like oobabooga text-generation-webui
already does on the same miniconda environment whithout any problems?