I am exploring sentence transformers and came across this page. It shows how to train on our custom data. But I am not sure how to predict. If there are two new sentences such as 1) this is the third example, 2) this is the example number three. How could I get a prediction about how similar those sentences are?
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
----------------------------update 1
I updated the code as below
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
#Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
#Define your train examples. You need more than just two examples...
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
Saved the model...main change as compared to the old code
model_save_path2 = '/content/gdrive/MyDrive/folderName1/folderName2/model_try-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
#Tune the model and save it too
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100,output_path=model_save_path2)
Not sure about the below steps
#loading the new model
model_new = SentenceTransformer(model_save_path)
#predicting
sentences = ["This is an example sentence", "Each sentence is converted"]
model_new.encode(sentences)
question 1)
is this a correct approach to get sentence embedding after training old model and creating a new model? I am confused because during fitting process we fed two sentences along with similarity measure. While for output we are inputting one sentence at a time and getting a sentence embedding for each sentence.
question 2)
If I would like to get similarity scores for two sentences, is the only option is to take sentence embeddings from output of this model and then use cosine similarity?