0

I am working on a project which involves calculating sentence similarity. Context vectors for each token in a sentence are generated using Hugging Face's BERT. The code below returns all the token vectors in a sentence.

sentence= "Hello this is a test sentence."
tokens = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=512)
output= model(tokens['input_ids'], attention_mask=tokens['attention_mask'])
token_vectors = output.hidden_states[-1].detach()
return token_vectors[0]

Upon researching various metrics, I came across Earth Mover's Distance (or Wasserstein Distance) since it may be used as a similarity measure. I did not opt for Cosine Similarity since my vectors do not have the same length (since sentences might not be of the same length) and my vectors are also multi-dimensional (since each token is represented by a context vector).

I found Scipy's implementation, but this is only for 1D distributions. I also found a Stack Overflow question which seemed promising until I realized that despite catering for multi-dimensional vectors, it requires vectors to be of equal length.

Does anyone have any suggestions on how I may implement this? Or perhaps, suggestions for a technique other than Earth Mover's Distance that can be used?

Thanks in advance!

Mar
  • 106
  • 1
  • 9

0 Answers0