Questions tagged [llama-cpp-python]

7 questions
1
vote
1 answer

Very slow Response from LLM based Q/A query engine

I built a Q/A query bot over a 4MB csv file I have in my local, I'm using chroma for vector DB creation and with embedding model being Instructor Large from hugging face, and LLM chat model being LlamaCPP=llama2-13b-chat, The Vector Database created…
0
votes
0 answers

Format LLama 2 Output is not parsed correctly

I've encountered difficulties in obtaining a solution to my inquiry after multiple attempts. I'm currently utilizing LLama 2 in conjunction with LangChain for the first time. The challenge I'm facing pertains to extracting the response from LLama in…
Udemytur
  • 79
  • 1
  • 5
0
votes
2 answers

AssertionError when using llama-cpp-python in Google Colab

I'm trying to use llama-cpp-python (a Python wrapper around llama.cpp) to do inference using the Llama LLM in Google Colab. My code looks like this: !pip install llama-cpp-python from llama_cpp import ChatCompletionMessage, Llama model = Llama( …
0
votes
0 answers

Inferencing LLAMA-2 13B

I have been inferencing LLAMA-13B and suddenly following error arose, --------------------------------------------------------------------------- AssertionError Traceback (most recent call…
Malik
  • 29
  • 3
0
votes
0 answers

llama-cpp-python on macbook(M1) get unexpected series of '\x1c' as output

In order to use the GPU on macbook (M1 chip), install the llama-cpp-python CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python Download model file from https://huggingface.co/TheBloke/Trurl-2-7B-GGML/tree/main Model name is…
geralt
  • 1
  • 3
0
votes
1 answer

llama-cpp-python not using NVIDIA GPU CUDA

I have been playing around with oobabooga text-generation-webui on my Ubuntu 20.04 with my NVIDIA GTX 1060 6GB for some weeks without problems. I have been using llama2-chat models sharing memory between my RAM and NVIDIA VRAM. I installed without…
imbr
  • 6,226
  • 4
  • 53
  • 65
-2
votes
0 answers

Cannot install llamacpp module provided by langchain

n_gpu_layers = 32 # Change this value based on your model and your GPU VRAM pool. n_batch = 256 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU. # Loading model, llm = LlamaCpp( …