I looked through lot of documentation but got confused on the retriever part.
So I am building a chatbot using user's custom data.
- User will feed the data
- Data should be upserted to Pinecone
- Then later user can chat with their data
- there can be multiple users and each user will be able to chat with their own data.
Now I am following below approach
- Storing user data into Pinecone
def doc_preprocessing(content):
doc = Document(page_content=content)
text_splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0
)
docs_split = text_splitter.split_documents([doc])
return docs_split
def embedding_db(user_id, content):
docs_split = doc_preprocessing(content)
# Extract text from the split documents
texts = [doc.page_content for doc in docs_split]
vectors = embeddings.embed_documents(texts)
# Store vectors with user_id as metadata
for i, vector in enumerate(vectors):
upsert_response = index.upsert(
vectors=[
{
'id': f"{user_id}",
'values': vector,
'metadata': {"user_id": str(user_id)}
}
]
)
This way it should create embeddings for the given data into pinecone.
Now the second part is to chat with this data. For QA, I have below
def retrieval_answer(user_id, query):
text_field = "text"
vectorstore = Pinecone(
index, embeddings.embed_query, text_field
)
vectorstore.similarity_search(
query,
k=10,
filter={
"user_id": str(user_id)
},
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vectorstore.as_retriever(),
)
result = qa.run(query)
print("Result:", result)
return result
but I keep getting
Found document with no `text` key. Skipping.
When i am doing QA, its not referring to the data stored in pinecone. Its just using the normal chatgpt. I am not sure what i am missing here.