1

I'm trying to run few documents through OpenAI’s text embedding API and insert the resulting embedding along with text in the Chroma database locally.

sales_data = medium_data_split + yt_data_split
sales_store = Chroma.from_documents(
    sales_data, embeddings, collection_name="sales"
)

This process fails because I get RateLimitError: Rate limit reached for default-text-embedding-ada-002 error from OpenAI API as I'm using my personal account.

As a work around, I want to run a loop by splitting the big medium_data_split documents into smaller chunks and stage them running through OpenAI API embedding with a minute gap each.

To do that I want to join/combine resulting chroma databases but I couldn't find a way yet. Can someone suggest one?

Tried this way

sales_data1 = yt_data_split
sales_store1 = Chroma.from_documents(
    sales_data1, embeddings, collection_name="sales"
)

sales_data2 = medium_data_split[0:25]
sales_store2 = Chroma.from_documents(
    sales_data2, embeddings, collection_name="sales"
)

sales_store_concat = sales_store1.add(sales_store2)

I get the following error: AttributeError: 'Chroma' object has no attribute 'add'

1 Answers1

0

One solution would be use TextSplitter to split the documents into multiple chunks and store it in disk.

split it into chunks

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

You can also use OpenSource Embeddings like SentenceTransformerEmbeddings for creation of embeddings.

create the open-source embedding function

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

Then you can search the Embeddings for similarity and depending upon the similarity score you can select the chunks and send it to OpenAI.

save to disk

    db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
    db2.persist()

    docs_and_scores = db2.similarity_search_with_score("search query", k=2, fetch_k=5)

Also there is an option of merging vectorstores Please check: https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/faiss

Reference : https://python.langchain.com/docs/modules/data_connection/document_transformers/

https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers

dassum
  • 4,727
  • 2
  • 25
  • 38