8

I have written LangChain code using Chroma DB to vector store the data from a website url. It currently works to get the data from the URL, store it into the project folder and then use that data to respond to a user prompt. I figured out how to make that data persist/be stored after the run, but I can't figure out how to then load that data for future prompts. The goal is a user input is received, and the program using OpenAI LLM will generate a response based on the existing database files, as opposed to the program needing to create/write those database files on each run. How can this be done?

What should I do?

I tried this as this would likely be the ideal solution:

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=vectordb)

But the from_chain_type() function doesn't take a vectorstore db as an input, so therefore this doesn't work.

Destroy666
  • 892
  • 12
  • 19
max choate
  • 81
  • 1
  • 3

4 Answers4

9

You need to define the retriever and pass that to the chain. That will use your previously persisted DB to be used in queries.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Andrew
  • 183
  • 1
  • 8
1

I have tried to use the Chroma vector store loader as well, but my code won't load the DB from the disk. Here is what I did:

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFDirectoryLoader
import os
import json

def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

# Setup
api_key = load_api_key()
os.environ["OPENAI_API_KEY"] = api_key

# load the document and split it into chunks
loader = PyPDFDirectoryLoader("LINK TO FOLDER WITH PDF")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load docs into Chroma DB
db = Chroma.from_documents(docs, embedding_function)

# query the DB
query = "MY QUERY"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

So far no problems! Then when I load the DB with this code:

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
db3.get() 
docs = db3.similarity_search(query)
print(docs[0].page_content)

The db3.get() already shows that there is no data in db3. It returns:

{'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

Any ideas why this could by?

Heka
  • 73
  • 1
  • 8
  • When I use FAISS instead of Chroma as a vector store it works. Simply replace the respective codes with `db = FAISS.from_documents(docs, embedding_function)`, `db2 = db.save_local("faiss_index")` and `db3 = FAISS.load_local("faiss_index", embedding_function)`. – Heka Aug 03 '23 at 14:27
  • If you have a new question, please ask it by clicking the [Ask Question](https://stackoverflow.com/questions/ask) button. Include a link to this question if it helps provide context. - [From Review](/review/late-answers/34789768) – doneforaiur Aug 08 '23 at 05:05
  • I didn't find the right solution but get one workaound for me: to export db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db") in the place you need it. (db2.similarity_search(query) ) – s00103898-276165-15433 Aug 14 '23 at 20:18
0

just find the following works:

def fetch_embeddings(collection_name):
    collection = chromadb_client.get_collection(
        name=collection_name, embedding_function=langchain_embedding_function
    )
    embeddings = collection.get(include=["embeddings"])

    print(collection.get(include=["embeddings", "documents", "metadatas"]))

    return embeddings

reference: https://docs.trychroma.com/usage-guide

0

Chroma provides get_collection at

https://docs.trychroma.com/reference/Client#get_collection

Here's an example of my code to query an existing vectorStore >

def get(embedding_function):
    db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    print(db.get().keys())
    print(len(db.get()["ids"]))

The code output with 7580 chunks, as example >

Using embedded DuckDB with persistence: data will be stored in: ./chroma_db
dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])
7580
j3ffyang
  • 1,049
  • 12
  • 12