My idea is to extract the CLS
token for all the text in the DB and save it in CSV or somewhere else. So when a new text comes in, instead of using the Cosine Similarity/JAccard/MAnhattan/Euclidean
or other distances, I have to use some approximation like LSH, ANN (ANNOY, sklearn.neighbor)
or the one given here faiss
. How can that be done? I have my code as:
PyTorch:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, I am a text")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
and I think can get the CLS
token as: (Please correct if wrong)
last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]
Please tell me if it's the right way to use and how can I use any of the LSH, ANNOT, faiss
or something like that?
So for every text, there'll a 768
length vector and we can create a N(No of texts 10M)x768 matrix, how can I find the Index of top-5
data points (texts) which are most similar to the given image/embedding/data point?