1

I have a data which is having more than 1500 rows. Each row has a sentence. I am trying to find out the best method to find the most similar sentences among all. I have tried this example but the processing is so much slow that it took around 20 minutes for 1500 rows data.

I have used the code from my previous question and tried many types to improve the speed but it doesn't affect much. I came across universal sentence encoder using tensorflow which seems fast and having good accuracy. I am working on colab you can check it here

import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5", "https://tfhub.dev/google/universal-sentence-encoder-lite/2"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                [11,"MAXPREDO Validation is corect"],
                                                                [12,"Move to QC"],
                                                                [13,"Cancel ASN WMS Cancel ASN"],
                                                                [14,"MAXPREDO Validation is right"],
                                                                [15,"Verify files are sent every hours for this interface from Optima"],
                                                                [16,"MAXPREDO Validation are correct"],
                                                                [17,"Move to QC"],
                                                                [18,"Verify files are not sent"]
                                                                ]))

message_embeddings = embed(messages)

for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
  print("Message: {}".format(messages[i]))
  print("Embedding size: {}".format(len(message_embedding)))
  message_embedding_snippet = ", ".join(
      (str(x) for x in message_embedding[:3]))
  print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

What I am looking for

I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent 

Expected result

Above data which are similar upto 0.90% should get as a result with ID

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC 
vivek
  • 61
  • 1
  • 1
  • 8

1 Answers1

3

There are multiple ways in which you can find similarity between two embedding vectors. The most common is cosine_similarity.

Therefore the first thing you have to do is calculate the similarity matrix:

Code:

message_embeddings = embed(list(df['DESCRIPTION']))
cos_sim = sklearn.metrics.pairwise.cosine_similarity(message_embeddings)

You get a 9*9 matrix with similarity value. You can create a heatmap of this matrix to visualize it.

Code:

def plot_similarity(labels, corr_matrix):
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr_matrix,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=90)
  g.set_title("Semantic Textual Similarity")

plot_similarity(list(df['DESCRIPTION']), cos_sim)

Output:

Matrix

The darker box means more similarity.

And finally, you iterate over this cos_sim matrix to get all the similar sentence using threshold:

threshold = 0.8
row_index = []
for i in range(cos_sim.shape[0]):
  if i in row_index:
    continue
  similar = [index for index in range(cos_sim.shape[1]) if (cos_sim[i][index] > threshold)]
  if len(similar) > 1:
    row_index += similar

sim_df = pd.DataFrame()
sim_df['ID'] = [df['ID'][i] for i in row_index]
sim_df['DESCRIPTION'] = [df['DESCRIPTION'][i] for i in row_index]
sim_df

The data frame looks like this.
Output:

This

There, are different methods with which you can generate the similarity matrix. You can take a look at this for more methods.

Aniket Bote
  • 3,456
  • 3
  • 15
  • 33
  • I have gone through the code and I executed the same. I wondering why at 0.8 threshold I am getting terms on index [17] [18] as similar. Although the are not that much similar – vivek Sep 10 '20 at 11:30
  • They are not similar its simply the next term in the list. – Aniket Bote Sep 10 '20 at 11:53
  • I have updated my answer. Now the terms will only be included if they have similar terms. – Aniket Bote Sep 10 '20 at 12:48
  • I recommend you to ask a new question on SO for the topics mentioned above. – Aniket Bote Sep 10 '20 at 15:03
  • Can you tell me when I put threshold 1 still I am getting some rows which are not similar which are not completely even same. And as per threshold we should get exactly same. – vivek Sep 10 '20 at 17:07
  • After I put the threshold as 1 I get only 4 rows with id 10,13,12,17. Use `>=` when using the threshold as 1. – Aniket Bote Sep 10 '20 at 17:14
  • @vivek Please don't ask new or additional questions in comments, particularly ones which are not *very directly* related to the question or answer. If you need a bit of clarification about an answer, then it's OK to ask for clarification. However, for new questions you should [ask a new question](/questions/ask). When you do, both you and the answerer can get more reputation points from the questions and answer(s). In addition to feeling good about helping people, the primary thing answerers get for putting in time here is reputation points. They don't get those by answering your comments. – Makyen Sep 14 '20 at 21:37