I am using Faiss to index my huge dataset embeddings, embedding generated from bert model. I want to add the embeddings incrementally, it is working fine if I only add it with faiss.IndexFlatL2 , but the problem is while saving it the size of it is too large. So I tried with faiss.IndexIVFPQ, but it needs to train embeddings before I add the data, so I can not add it incrementally, I have to compute all embeddings first and then train and add it, it is having issue because all the data should be kept in RAM till I write it. Is there any way to do this incrementally. Here is my code:
# It is working fine when using with IndexFlatL2
def __init__(self, sentences, model):
self.sentences = sentences
self.model = model
self.index = faiss.IndexFlatL2(768)
def process_sentences(self):
result = self.model(self.sentences)
self.sentence_ids = []
self.token_ids = []
self.all_tokens = []
for i, (toks, embs) in enumerate(tqdm(result)):
# initialize all_embeddings for every new sentence (INCREMENTALLY)
all_embeddings = []
for j, (tok, emb) in enumerate(zip(toks, embs)):
self.sentence_ids.append(i)
self.token_ids.append(j)
self.all_tokens.append(tok)
all_embeddings.append(emb)
all_embeddings = np.stack(all_embeddings) # Add embeddings after every sentence
self.index.add(all_embeddings)
faiss.write_index(self.index, "faiss_Model")
ANd when using with IndexIVFPQ:
def __init__(self, sentences, model):
self.sentences = sentences
self.model = model
self.quantizer = faiss.IndexFlatL2(768)
self.index = faiss.IndexIVFPQ(self.quantizer, 768, 1000, 16, 8)
def process_sentences(self):
result = self.model(self.sentences)
self.sentence_ids = []
self.token_ids = []
self.all_tokens = []
all_embeddings = []
for i, (toks, embs) in enumerate(tqdm(result)):
for j, (tok, emb) in enumerate(zip(toks, embs)):
self.sentence_ids.append(i)
self.token_ids.append(j)
self.all_tokens.append(tok)
all_embeddings.append(emb)
all_embeddings = np.stack(all_embeddings)
self.index.train(all_embeddings) # Train
self.index.add(all_embeddings) # Add to index
faiss.write_index(self.index, "faiss_Model_mini")