5

I am using Faiss to index my huge dataset embeddings, embedding generated from bert model. I want to add the embeddings incrementally, it is working fine if I only add it with faiss.IndexFlatL2 , but the problem is while saving it the size of it is too large. So I tried with faiss.IndexIVFPQ, but it needs to train embeddings before I add the data, so I can not add it incrementally, I have to compute all embeddings first and then train and add it, it is having issue because all the data should be kept in RAM till I write it. Is there any way to do this incrementally. Here is my code:

    # It is working fine when using with IndexFlatL2
    def __init__(self, sentences, model):
        self.sentences = sentences
        self.model = model
        self.index = faiss.IndexFlatL2(768)

    def process_sentences(self):
        result = self.model(self.sentences)
        self.sentence_ids = []
        self.token_ids = []
        self.all_tokens = []
        for i, (toks, embs) in enumerate(tqdm(result)):
            # initialize all_embeddings for every new sentence (INCREMENTALLY)
            all_embeddings = []
            for j, (tok, emb) in enumerate(zip(toks, embs)):
                self.sentence_ids.append(i)
                self.token_ids.append(j)
                self.all_tokens.append(tok)
                all_embeddings.append(emb)

            all_embeddings = np.stack(all_embeddings) # Add embeddings after every sentence
            self.index.add(all_embeddings)

        faiss.write_index(self.index, "faiss_Model")

ANd when using with IndexIVFPQ:

   def __init__(self, sentences, model):
       self.sentences = sentences
       self.model = model
       self.quantizer = faiss.IndexFlatL2(768)
       self.index = faiss.IndexIVFPQ(self.quantizer, 768, 1000, 16, 8)

   def process_sentences(self):
       result = self.model(self.sentences)
       self.sentence_ids = []
       self.token_ids = []
       self.all_tokens = []
       all_embeddings = []
       for i, (toks, embs) in enumerate(tqdm(result)):
           for j, (tok, emb) in enumerate(zip(toks, embs)):
               self.sentence_ids.append(i)
               self.token_ids.append(j)
               self.all_tokens.append(tok)
               all_embeddings.append(emb)

       all_embeddings = np.stack(all_embeddings)
       self.index.train(all_embeddings) # Train
       self.index.add(all_embeddings) # Add to index
       faiss.write_index(self.index, "faiss_Model_mini")
DevPy
  • 439
  • 6
  • 17
  • Is there anything stopping you from adding as many vectors as you can before hitting memory limits, training your index, then iteratively loading and adding the remaining vectors *without* training? I know ideally it would be better to train on the full dataset but I can't think of a way to load all and train on all if the system can't handle the size needed. – James Briggs Nov 12 '21 at 08:09
  • Does it affect the results, if I train as much data till the memory limit and after that add data without training? I am new to faiss! – DevPy Nov 12 '21 at 10:41
  • Ideally you would train with the full dataset, but as with your case it's not always realistic - so try and train with a sample that is representative of the (expected) full dataset, then add your data after. The training aspect of PQ focuses on creating clusters (I wrote about it [here](https://www.pinecone.io/learn/product-quantization/)) and as long as your new vectors fall into these clusters nicely, there's no problem. – James Briggs Nov 12 '21 at 14:35
  • Ok thanks @JamesBriggs ! Also I want to know if I can train the dataset multiple times? So like I train it in batches, like 15k sentences train and add then another 15k sentences train and add and so on.. did it work or only one time training is needed? – DevPy Nov 15 '21 at 04:30
  • I could be wrong but I believe training is a one time thing, to retrain an index you create a new index and train it - but that wouldn't help in your case – James Briggs Nov 15 '21 at 10:37
  • 2
    oh okay.. So can I train it one time with as much as data I can and then simply next time load the trained index using faiss.read_index(file) and add more indexes to it?? – DevPy Nov 15 '21 at 13:14

0 Answers0