2

I have a dataframe with only one column , and 1000 rows in that column. I need to compare all rows and find Levenshtein distance for all rows . how Do i calculate that ratio or distance in python

I have a dataframe as following:

  #Df 
  StepDescription
  click confirm button when done
  you have logged on
  please log in to proceed
  click on confirm button
  Dolb was released successfully
  Enter your details
  validate the statement
  Aval was released sucessfully

How to do i Calculate Levenshtein ration for all these

Code I have written to iterate through loops but after iterating how to proceed.

  import Levenshtein
  import pandas as pd
  data_dist = pd.read_csv('path\Data_TestDescription.csv')
  df = pd.DataFrame(data_dist)
  for index, row in df.iterrows():
cs95
  • 379,657
  • 97
  • 704
  • 746
Sayli Jawale
  • 159
  • 1
  • 18
  • Expected output? It seems like you haven't even tried anything. – cs95 Nov 07 '17 at 07:58
  • I need to get percentage i.e ratio between the each and every row. I have not tried out as I dont know after iterating how to i compute this distance between these rows. – Sayli Jawale Nov 07 '17 at 08:02
  • I still want to see some kind of expected output. – cs95 Nov 07 '17 at 08:04
  • For example : I have two strings : String 1 : Dolb was released successfully String 2 : Aval was released sucessfully SO for these two strings i need to find similarity ration.. so my code to calculate similarity will be : Levenshtein.ratio('Dolb was released successfully','Aval was released sucessfully') and expected output can be 0.8813559322033898. , but now I wanna do it for all my rows .. so how do i iterate this and find those distances. – Sayli Jawale Nov 07 '17 at 08:10
  • See if my answer gives you what you want. If not, please critique or request clarification. – cs95 Nov 07 '17 at 08:11
  • 1k x 1k matrix = 1M values, from which 1k are known (the string with itself): 0 and half of the rest are duplicates as `dist(A, B) = dist(B, A)`. That means 499k5 values to be computed. Do you really need to calculate the distance of every possible pair? – Adirio Nov 07 '17 at 08:17
  • @Adirio yes for every possible pairs – Sayli Jawale Nov 07 '17 at 09:11

2 Answers2

3

As asked in a comment, the percentage is desired, I'll keep the accepteds answer and add just the new part:

import numpy as np
import pandas as pd
from Levenshtein import distance
from itertools import product

#df = ...

dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]

dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
dist_df

    0   1   2   3   4   5   6   7
0   0  23  23  13  29  25  25  28
1  23   0  18  18  23  18  18  23
2  23  18   0  20  25  21  19  24
3  13  18  20   0  27  19  21  26
4  29  23  25  27   0  26  23   5
5  25  18  21  19  26   0  19  25
6  25  18  19  21  23  19   0  21
7  28  23  24  26   5  25  21   0

dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100

     0    1    2    3    4    5    6    7
0    0  460  460  260  580  500  500  560
1  460    0  360  360  460  360  360  460
2  460  360    0  400  500  420  380  480
3  260  360  400    0  540  380  420  520
4  580  460  500  540    0  520  460  100
5  500  360  420  380  520    0  380  500
6  500  360  380  420  460  380    0  420
7  560  460  480  520  100  500  420    0
Adirio
  • 5,040
  • 1
  • 14
  • 26
0

Finally after lots of example I tried I got exact ratio or percentage using fuzzratio

from itertools import product
import numpy as np
import difflib
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import editdistance
dist = np.empty(df.shape[0]**2, dtype=int) 
for i, x in enumerate(product(df.Stepdescription, repeat=2)): 
    dist[i] = fuzz.ratio(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
out_csv= dist_df.to_csv('FuzzyRatio.csv', sep='\t')
Sayli Jawale
  • 159
  • 1
  • 18