I built a Q/A query bot over a 4MB csv file I have in my local, I'm using chroma for vector DB creation and with embedding model being Instructor Large from hugging face, and LLM chat model being LlamaCPP=llama2-13b-chat, The Vector Database created was around 44MB (stored it on Local), and after vector DB creation, I used it to make the query Q/A bot but the response is too slow, it takes around 30-40 mins for each response to be generated, in addition it says Llama.generate: prefix-match hit
as warning from the 2nd question itself. I'm not understanding why it is so slow...
- Is there something wrong with the models?
- Or is it because of my PC capabilities? Although I think my PC is well capable enough to handle such small data and these models..(my CPU usage during response time less than 60%)
- Am I doing anything wrong? Pretty new to this stuff
my PC specifications: Processor : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz RAM: 16GB System Type: 64bit OS, x64-based processor
from llama_index import load_index_from_storage
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index import LangchainEmbedding, ServiceContext, StorageContext, download_loader, LLMPredictor
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.response_synthesizers import get_response_synthesizer
import chromadb
from chromadb.config import Settings
## create ChromaClient again
chroma_client = chromadb.PersistentClient(path="./storage/vector_storage/chromadb/")
# load the collection
collection = chroma_client.get_collection("csv_ecgi_db")
## construct storage context
load_storage_context = StorageContext.from_defaults(
vector_store = ChromaVectorStore(chroma_collection=collection),
index_store = SimpleIndexStore.from_persist_dir(persist_dir="./storage/index_storage/ecgi/"),
)
embeddiing_model_id = 'hkunlp/instructor-large'
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name = embeddiing_model_id))
## construct service context
load_service_context = ServiceContext.from_defaults(embed_model=embed_model)
## finally to load the index
load_index = load_index_from_storage(service_context=load_service_context,
storage_context=load_storage_context)
# configure response synthesizer
response_synthesizer = get_response_synthesizer(
response_mode='compact',
service_context = load_service_context)
# assemble query engine
query_engine = RetrieverQueryEngine(
retriever = retriever,
response_synthesizer = response_synthesizer,
)
# query
response = query_engine.query("what were the danish Horror movies in february of 2023?")
response
I looked over git, I found some people were discussing about the same stuff thing but no conclusion was reached, but there response time was similar to mine. I was expecting it to respond within seconds like ChatGPT does.