How to perform the Text Similarity using BERT on 10M+ corpus? Using LSH/ ANNOY/ fiass or sklearn?

Question

My idea is to extract the CLS token for all the text in the DB and save it in CSV or somewhere else. So when a new text comes in, instead of using the Cosine Similarity/JAccard/MAnhattan/Euclidean or other distances, I have to use some approximation like LSH, ANN (ANNOY, sklearn.neighbor) or the one given here faiss . How can that be done? I have my code as:

PyTorch:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, I am a text")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Using Tensorflow:

import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

and I think can get the CLS token as: (Please correct if wrong)

last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]

Please tell me if it's the right way to use and how can I use any of the LSH, ANNOT, faiss or something like that?

So for every text, there'll a 768 length vector and we can create a N(No of texts 10M)x768 matrix, how can I find the Index of top-5 data points (texts) which are most similar to the given image/embedding/data point?

How to perform the Text Similarity using BERT on 10M+ corpus? Using LSH/ ANNOY/ fiass or sklearn?

0 Answers0