How to combine two Chroma databases

Question

I created two dbs like this (same embeddings) using langchain 0.0.143:

db1 = Chroma.from_documents(
    documents=texts1,
    embedding=embeddings, 
    persist_directory=persist_directory1,
)
db1.persist()

db21 = Chroma.from_documents(
    documents=texts2,
    embedding=embeddings, 
    persist_directory=persist_directory2,
)
db2.persist()

then later accessing them with

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

How do I combine db1 and db2? I want to use them in a ConversationalRetrievalChain setting retriever=db.as_retriever().

I tried a couple of suggestions from searching but am missing something obvious

score 3 · Answer 1 · answered Apr 30 '23 at 10:14

The simpler option is going to be loading the two documents into the same Chroma object. They'll retain separate metadata, so you can still tell which document each embedding came from:

from langchain.vectorstores import Chroma

chroma_directory = 'db/'

db = Chroma(persist_directory=chroma_directory, embedding_function=embedding)

db.add_documents(documents=texts1)
db.add_documents(documents=texts2)

db.similarity_search_with_score(query="Introduction to the document")
# --> results from both documents

The more complicated option: default Chroma storage is two parquet files and an index. If you could guarantee no index conflicts, you could theoretically merge the respective parquet files and merge the two index/ folders by copying the content of each into a new index/ folder adjacent to the two new parquet files.

Thanks!! However, this requires one to remake the embeddings every time a new document is needed, right? I'd like to avoid that cost. — randomQs, Jun 22 '23 at 13:31
@randomQs did you try this/figure out if this is the case, does this work? — Alexander Hristov, Jul 11 '23 at 10:42

score 1 · Answer 2 · answered Jul 26 '23 at 15:05

Building on the above answer by Jordy, this is how I ended up doing it without rebuilding embeddings every time:

db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

db2_data=db2._collection.get(include=['documents','metadatas','embeddings'])
db1._collection.add(
     embeddings=db2_data['embeddings'],
     metadatas=db2_data['metadatas'],
     documents=db2_data['documents'],
     ids=db2_data['ids']
)

Langchain Chroma's default get() does not include embeddings, so calling collection.get through chromadb and asking for embeddings is necessary.

Jordy · Answer 3 · 2023-06-08T11:46:06.140

Another option would be to add the items from one Chroma db into the other Chroma db like so:


db1 = Chroma(
    persist_directory=persist_directory1,
    embedding_function=embeddings,
)

db2 = Chroma(
    persist_directory=persist_directory2,
    embedding_function=embeddings,
)

#can add collections up to 100K+
db1._collection.add(
     embeddings=db2.get()['embeddings'],
     metadatas=db2.get()['metadatas'],
     documents=db2.get()['documents'],
     ids=db2.get()['ids']
)

Note that the documentation suggests up to 100k+!, so there is a limit what you can add to the collection at once.

Source: https://docs.trychroma.com/api-reference#methods-related-to-collections

Note: using this method will join the specified source data (db2) to the target collection (db1). Meaning that if db1 has a collection named 'db1_collection' and db2 has a collection named 'db2_collection', using this method will only have 'db1_collection' remaining.

This actually makes new API calls for embeddings. – Alexander Hristov Jul 11 '23 at 08:21 — Alexander Hristov, Jul 11 '23 at 08:21

How to combine two Chroma databases

3 Answers3