I have two words and I want to calculate the similarity between them in order to rank them if they are duplicates or not.
How do I achieve that using deep learning / NLP methods?
I have two words and I want to calculate the similarity between them in order to rank them if they are duplicates or not.
How do I achieve that using deep learning / NLP methods?
Here's a few approaches to tackle text similarity
But before you consider which library to use to measure similarity, you should try to define what do you want to measure when it comes to similarity,
The dog ate the biscuit
vsThe biscuit was eaten by the dog
This problem is driving me mad!
vsThis problem is making me angry!
I ate Chinese food for dinner
vsI ate kungpao chicken for dinner
The ambiguity of "similarity" becomes even more complex when comparing individual words without context, e.g.
plant
vs factory
plant
refers to industrial plantplant
refers to the living thing plantbank
vs financial institute
bank
refers to the place we deposit or withdraw cashbank
refers to the river bank.There are many other aspect of similarity that one can define depending on the ultimate task that you want to do with the similarity score.
Here is a copy of the code from official documentation as per Alva's links - https://www.sbert.net/docs/usage/semantic_textual_similarity.html
The code in google colab - https://colab.research.google.com/drive/1Ak0xrn3zWf4Rh2YtVo1avGH-EerLhEDe?usp=sharing
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Two lists of words
word1 = ['cat',
'man',
'movie',
'friend']
word2 = ['kitten',
'boy',
'film',
'love']
#Compute embedding for both lists
embeddings1 = model.encode(word1, convert_to_tensor=True)
embeddings2 = model.encode(word2, convert_to_tensor=True)
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
#Output the pairs with their score
for i in range(len(word1)):
print("{} \t\t {} \t\t Score: {:.4f}".format(word1[i], word2[i], cosine_scores[i][i]))
Using the above code in colab, I got the following output
cat kitten Score: 0.7882
man boy Score: 0.5843
movie film Score: 0.8426
friend love Score: 0.4168
My conclusion is that for similarity in words unrelated to context, a threshold score above 75% works pretty well; if you provide some context, this model will perform even better.
If it's as simple as getting a score based on the similarity of a word then I'd suggest using fuzzy matching. Link to reference here: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe
Make sure you install:
from fuzzywuzzy import fuzz
print(fuzz.ratio("hello world!", "Hello worlds"))
>>> 83
If you want different computational matching you could follow this doc: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html
from Levenshtein import jaro
print(jaro("hello world!", "Hello worlds"))
>>> 0.888888888888889
There are two good ways to calculate the similarity between two words.
["I", "love", "the", "NLP"]
, you can represent the "NLP"
with a vector like this [0, 0, 0, 1]
. So with an embedding model, you can map words from a discrete space (the one hot vector) to vectors in a concrete space with different dimensionality (like 100 or 300). These new vectors have semantic knowledge of the words in themselves. It means the vectors of words with similar meanings are close to each other, and the vectors of words with different meanings are distant. These three models are pre-trained, and you need to get vectors of your words and use something like cosine similarity to find how close their semantics are to each other (if you have a particular dataset, it's better to fine-tune your model first).If you only want to find the similarity between words, I recommend the first solution.