3

I created two dbs like this (same embeddings) using langchain 0.0.143:

db1 = Chroma.from_documents(
    documents=texts1,
    embedding=embeddings, 
    persist_directory=persist_directory1,
)
db1.persist()

db21 = Chroma.from_documents(
    documents=texts2,
    embedding=embeddings, 
    persist_directory=persist_directory2,
)
db2.persist()

then later accessing them with

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

How do I combine db1 and db2? I want to use them in a ConversationalRetrievalChain setting retriever=db.as_retriever().

I tried a couple of suggestions from searching but am missing something obvious

Progman
  • 16,827
  • 6
  • 33
  • 48
randomQs
  • 31
  • 2

3 Answers3

3

The simpler option is going to be loading the two documents into the same Chroma object. They'll retain separate metadata, so you can still tell which document each embedding came from:

from langchain.vectorstores import Chroma

chroma_directory = 'db/'

db = Chroma(persist_directory=chroma_directory, embedding_function=embedding)

db.add_documents(documents=texts1)
db.add_documents(documents=texts2)

db.similarity_search_with_score(query="Introduction to the document")
# --> results from both documents

The more complicated option: default Chroma storage is two parquet files and an index. If you could guarantee no index conflicts, you could theoretically merge the respective parquet files and merge the two index/ folders by copying the content of each into a new index/ folder adjacent to the two new parquet files.

Simon Podhajsky
  • 307
  • 1
  • 3
  • 13
1

Building on the above answer by Jordy, this is how I ended up doing it without rebuilding embeddings every time:

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

db2_data=db2._collection.get(include=['documents','metadatas','embeddings'])
db1._collection.add(
     embeddings=db2_data['embeddings'],
     metadatas=db2_data['metadatas'],
     documents=db2_data['documents'],
     ids=db2_data['ids']
)

Langchain Chroma's default get() does not include embeddings, so calling collection.get through chromadb and asking for embeddings is necessary.

Mike Feng
  • 11
  • 1
0

Another option would be to add the items from one Chroma db into the other Chroma db like so:


db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

#can add collections up to 100K+
db1._collection.add(
     embeddings=db2.get()['embeddings'],
     metadatas=db2.get()['metadatas'],
     documents=db2.get()['documents'],
     ids=db2.get()['ids']
)

Note that the documentation suggests up to 100k+!, so there is a limit what you can add to the collection at once.

Source: https://docs.trychroma.com/api-reference#methods-related-to-collections

Note: using this method will join the specified source data (db2) to the target collection (db1). Meaning that if db1 has a collection named 'db1_collection' and db2 has a collection named 'db2_collection', using this method will only have 'db1_collection' remaining.

Jordy
  • 176
  • 4