4

Suggestions / refer links /codes are appreciated.

I have a data which is having more than 1500 rows. Each row has a sentence. I am trying to find out the best method to find the most similar sentences among all.

What I have tried

  1. I have tried K-mean algorithm which groups similar sentences in a cluster. But I found a drawback in which I have to pass K to create a cluster. It is hard to guess K. I tried elbo method to guess the clusters but grouping all together isn't sufficient. In this approach I am getting all the data grouped. I am looking for data which is similar above 0.90% data should be returned with ID.

  2. I tried cosine similarity in which I used TfidfVectorizer to create matrix and then passed in cosine similarity. Even this approach didn't worked properly.

What I am looking for

I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent

Expected result

Above data which are similar upto 0.90% should get as a result with ID

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC  
vivek
  • 61
  • 1
  • 1
  • 8

3 Answers3

10

Why did it not work for you with cosine similarity and the TFIDF-vectorizer?

I tried it and it works with this code:

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                [11,"MAXPREDO Validation is corect"],
                                                                [12,"Move to QC"],
                                                                [13,"Cancel ASN WMS Cancel ASN"],
                                                                [14,"MAXPREDO Validation is right"],
                                                                [15,"Verify files are sent every hours for this interface from Optima"],
                                                                [16,"MAXPREDO Validation are correct"],
                                                                [17,"Move to QC"],
                                                                [18,"Verify files are not sent"]
                                                                ]))

corpus = list(df["DESCRIPTION"].values)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

threshold = 0.4

for x in range(0,X.shape[0]):
  for y in range(x,X.shape[0]):
    if(x!=y):
      if(cosine_similarity(X[x],X[y])>threshold):
        print(df["ID"][x],":",corpus[x])
        print(df["ID"][y],":",corpus[y])
        print("Cosine similarity:",cosine_similarity(X[x],X[y]))
        print()

The threshold can be adjusted as well, but will not yield the results you want with a threshold of 0.9.

The output for a threshold of 0.4 is:

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]

With a threshold of 0.39 all your expected sentences are features in the output, but an additional pair with the indices [15,18] can be found as well:

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]
Kim Tang
  • 2,330
  • 2
  • 9
  • 34
  • well this quite impressive and working like a charm. I found some challenges like.. 1) Sentences which are repetitive more than 2 times the first sentences keeps on repeating example : indices [11] repeated for [11, 14] and [11, 16] 2) Rows are more than 1500 when I ran this code for 500 rows it took about 1 minute to complete its task. And one ID gets repeated when it finds a duplicate sentence – vivek Sep 03 '20 at 09:30
  • I'm glad I could help. If my answer helped you, please mark it as "accepted answer" with the green checkmark. – Kim Tang Sep 03 '20 at 09:32
  • Sure I will mark this as an answer. But before can you suggest me how can I improve the speed and the repeating of rows when duplicate is found ? – vivek Sep 03 '20 at 09:34
  • 1
    for 1500 rows it is taking above 20 minutes – vivek Sep 03 '20 at 10:54
  • I am not sure how to improve the speed. I recommend you to open up a new question with the speed as the focus and to include information about what you have tried there, so that others can help you. – Kim Tang Sep 03 '20 at 10:55
  • I have posted another question please check once if you can answer that – vivek Sep 10 '20 at 07:39
  • If you have better approach please answer it would be helpful for me I have also mentioned the notebook link up there – vivek Sep 10 '20 at 11:30
3

A possible way would be to use word-embeddings to create vector-representations of your sentences. Like you use pretrained word-embeddings and let a rnn layer create a sentence vector-representation, where the word-embeddings of each sentence are combined. Then you have a vector, where you could calculate distances between. But you need to decide, which threshold you want to set, so a sentence is accepted as similar, since the scales of word-embeddings are not fixed.

Update

I did some experiments. In my opinion, this is a viable method for such a task, however, you might want to find out for yourself, how well it is working in your case. I created an example in my git repository.

Also the word-mover-distance algorithm can be used for this task. You can find more information about this topic in this medium article.

MichaelJanz
  • 1,775
  • 2
  • 8
  • 23
  • can you share any snippet for reference ? – vivek Sep 03 '20 at 07:20
  • Yes, I will see if I can set up something – MichaelJanz Sep 03 '20 at 07:22
  • well I have gone through the example look like you are using RNN and the code is quite impressive I am confused how can further relate it to my work. It would be great if you can relate it to my question further – vivek Sep 03 '20 at 11:00
3

One can use this Python 3 library to compute sentence similarity: https://github.com/UKPLab/sentence-transformers

Code example from https://www.sbert.net/docs/usage/semantic_textual_similarity.html:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The library contains the state-of-the-art sentence embedding models.

See https://stackoverflow.com/a/68728666/395857 to perform sentence clustering.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501