1

I am using langchain to read data from a pdf and convert it into a chunks of text. I then embed the data into vectors and load it into a vector store using pinecone. I am getting a maxretry error.

I guess I am loading all the chunks at once which may be causing the issue. Is there some function like add_document which can be used to load data/chunks one by one.

def load_document(file):
    from langchain.document_loaders import PyPDFLoader
    print(f'Loading {file} ..')
    loader = PyPDFLoader(file)
    #the below line will return a list of langchain documents.1 document per page
    data = loader.load() 
    return data


data=load_document("DATA/capacitance.pdf")
#prints  content of second page
print(data[1].page_content)
print(data[2].metadata)

#chunking
def chunk_data(data,chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=0)
    chunks=text_splitter.split_documents(data)
    print(type(chunks))
    return chunks

chunks=chunk_data(data)
print(len(chunks))

Till chunking my code works well. It is able to load pdf convert to text and chunk the data as well. Now when it comes to embedding, I tried using Pinecone and FAISS. For pine cone I already created an index 'electrostatics'

pinecone.create_index('electrostatics',dimension=1536,metric='cosine')

import os
from dotenv import load_dotenv,find_dotenv
load_dotenv("D:/test/.env")
print(os.environ.get("OPENAI_API_KEY"))

def insert_embeddings(index_name,chunks):
    import pinecone
    from langchain.vectorstores import Pinecone
    from langchain.embeddings.openai import OpenAIEmbeddings

    embeddings=OpenAIEmbeddings()
    pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"),environment=os.environ.get("PINECONE_ENV"))
    vector_store=Pinecone.from_documents(chunks,embeddings,index_name=index_name)
    print("Ok")

I tried embedding in the following ways

index_name='electrostatics'
vector_store=insert_embeddings(index_name,chunks)

With FAISS

from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings=OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)

enter image description here

InsertCheesyLine
  • 1,112
  • 2
  • 11
  • 19
slaveCoder
  • 519
  • 2
  • 17
  • 46
  • 1
    You have simply hit your OpenAI API quota. – Alberto Jul 11 '23 at 08:13
  • As above, definitely an OpenAI API quota issue. Here are some additional helpful threads: https://community.openai.com/t/i-am-getting-ratelimiterror/124977 and https://community.openai.com/t/rate-limit-error/14769 – Olney1 Jul 11 '23 at 10:30
  • is it because the chunksize is high. should i split it and try? – slaveCoder Jul 11 '23 at 11:56
  • You have exhausted your monthly limit for the API i.e you have consumed all the credits allocated to your plan. [More on this](https://help.openai.com/en/articles/6891831-error-code-429-you-exceeded-your-current-quota-please-check-your-plan-and-billing-details). You would need to upgrade to a new plan. – InsertCheesyLine Jul 11 '23 at 12:50

1 Answers1

1

This error typically happends when there are connection issues or timeouts, I think it is better to insert data in chunks like this:

def insert_embeddings(index_name, chunks):
    import pinecone
    from langchain.vectorstores import Pinecone
    from langchain.embeddings.openai import OpenAIEmbeddings

    embeddings = OpenAIEmbeddings()
    pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"), environment=os.environ.get("PINECONE_ENV"))

    vector_store = Pinecone(index_name=index_name, embeddings=embeddings)
    
    # Batch insert the chunks into the vector store
    batch_size = 100  # Define your preferred batch size
    for i in range(0, len(chunks), batch_size):
        chunk_batch = chunks[i:i + batch_size]
        vector_store.add_documents(chunk_batch)

    # Flush the vector store to ensure all documents are inserted
    vector_store.flush()

    print("Ok")
Omid Roshani
  • 1,083
  • 4
  • 15