I am using langchain to read data from a pdf and convert it into a chunks of text. I then embed the data into vectors and load it into a vector store using pinecone. I am getting a maxretry error.
I guess I am loading all the chunks at once which may be causing the issue. Is there some function like add_document which can be used to load data/chunks one by one.
def load_document(file):
from langchain.document_loaders import PyPDFLoader
print(f'Loading {file} ..')
loader = PyPDFLoader(file)
#the below line will return a list of langchain documents.1 document per page
data = loader.load()
return data
data=load_document("DATA/capacitance.pdf")
#prints content of second page
print(data[1].page_content)
print(data[2].metadata)
#chunking
def chunk_data(data,chunk_size=256):
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=0)
chunks=text_splitter.split_documents(data)
print(type(chunks))
return chunks
chunks=chunk_data(data)
print(len(chunks))
Till chunking my code works well. It is able to load pdf convert to text and chunk the data as well. Now when it comes to embedding, I tried using Pinecone and FAISS. For pine cone I already created an index 'electrostatics'
pinecone.create_index('electrostatics',dimension=1536,metric='cosine')
import os
from dotenv import load_dotenv,find_dotenv
load_dotenv("D:/test/.env")
print(os.environ.get("OPENAI_API_KEY"))
def insert_embeddings(index_name,chunks):
import pinecone
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings=OpenAIEmbeddings()
pinecone.init(api_key=os.environ.get("PINECONE_API_KEY"),environment=os.environ.get("PINECONE_ENV"))
vector_store=Pinecone.from_documents(chunks,embeddings,index_name=index_name)
print("Ok")
I tried embedding in the following ways
index_name='electrostatics'
vector_store=insert_embeddings(index_name,chunks)
With FAISS
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings=OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)