7

I have two words and I want to calculate the similarity between them in order to rank them if they are duplicates or not.

How do I achieve that using deep learning / NLP methods?

Kraigolas
  • 5,121
  • 3
  • 12
  • 37
Kleo Patra
  • 89
  • 1
  • 2

4 Answers4

10

Here's a few approaches to tackle text similarity

String-based approaches

Neural-based approaches

Machine Translation based approaches


But before you consider which library to use to measure similarity, you should try to define what do you want to measure when it comes to similarity,

Are you trying to find semantic similarity with syntactic difference?

  • The dog ate the biscuit vs
  • The biscuit was eaten by the dog

Are you trying to find lexical semantic similarity?

  • This problem is driving me mad! vs
  • This problem is making me angry!

Are you trying to find entailment instead of similarity?

  • I ate Chinese food for dinner vs
  • I ate kungpao chicken for dinner

The ambiguity of "similarity" becomes even more complex when comparing individual words without context, e.g.

  • plant vs factory

    • They can be similar, if plant refers to industrial plant
    • But they are dis-similar if plant refers to the living thing plant
  • bank vs financial institute

    • They can be similar if bank refers to the place we deposit or withdraw cash
    • But they are dis-similar if bank refers to the river bank.

There are many other aspect of similarity that one can define depending on the ultimate task that you want to do with the similarity score.

alvas
  • 115,346
  • 109
  • 446
  • 738
3

Here is a copy of the code from official documentation as per Alva's links - https://www.sbert.net/docs/usage/semantic_textual_similarity.html

The code in google colab - https://colab.research.google.com/drive/1Ak0xrn3zWf4Rh2YtVo1avGH-EerLhEDe?usp=sharing

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
    
# Two lists of words
word1 = ['cat',
         'man',
         'movie',
         'friend']
    
word2 = ['kitten',
          'boy',
          'film',
          'love']
    
#Compute embedding for both lists
embeddings1 = model.encode(word1, convert_to_tensor=True)
embeddings2 = model.encode(word2, convert_to_tensor=True)
  
#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
    
#Output the pairs with their score
for i in range(len(word1)):
        print("{} \t\t {} \t\t Score: {:.4f}".format(word1[i], word2[i], cosine_scores[i][i]))

Using the above code in colab, I got the following output

cat          kitten      Score: 0.7882
man          boy         Score: 0.5843
movie        film        Score: 0.8426
friend       love        Score: 0.4168

My conclusion is that for similarity in words unrelated to context, a threshold score above 75% works pretty well; if you provide some context, this model will perform even better.

2

If it's as simple as getting a score based on the similarity of a word then I'd suggest using fuzzy matching. Link to reference here: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe

Make sure you install:

  • fuzzywuzzy
  • python-Levenshtein
from fuzzywuzzy import fuzz

print(fuzz.ratio("hello world!", "Hello worlds"))

>>> 83

If you want different computational matching you could follow this doc: https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html

from Levenshtein import jaro

print(jaro("hello world!", "Hello worlds"))

>>> 0.888888888888889
jsn
  • 171
  • 6
1

There are two good ways to calculate the similarity between two words.

  1. You can simply use embedding models like word2vec, glove, or fasttext (my recommendation), which all are famous and useful. The main objective of embedding models is to map a word to a vector. The representation of words in your vocabulary is a one-hot vector. For example, if you have 4 words in your vocabulary, such as ["I", "love", "the", "NLP"], you can represent the "NLP" with a vector like this [0, 0, 0, 1]. So with an embedding model, you can map words from a discrete space (the one hot vector) to vectors in a concrete space with different dimensionality (like 100 or 300). These new vectors have semantic knowledge of the words in themselves. It means the vectors of words with similar meanings are close to each other, and the vectors of words with different meanings are distant. These three models are pre-trained, and you need to get vectors of your words and use something like cosine similarity to find how close their semantics are to each other (if you have a particular dataset, it's better to fine-tune your model first).
  2. You can use more complicated networks to find the similarity between two or more sentences. This task is known as Sentence Similarity, and they are helpful in unsupervised approaches and clusterings. These models are more complex and are better to use with sentences as inputs, so the model can see context and do a better job (the context is important because they're transformers and based on self-attentions which are complicated topics in neural networks). You can find these models here, and you need to use your words as inputs instead of sentences.

If you only want to find the similarity between words, I recommend the first solution.