0

I want to generate a string distance matrix using python as below.

          str1    str2    str3    str4    ...     str4k
  str1    0.8     0.4     0.6     0.1     ...     0.2
  str2    0.4     0.7     0.5     0.1     ...     0.1
  str3    0.6     0.5     0.6     0.1     ...     0.1
  str4    0.1     0.1     0.1     0.5     ...     0.6
  .       .       .       .       .       ...     .
  .       .       .       .       .       ...     .
  .       .       .       .       .       ...     .
  str20k    0.2     0.1     0.1     0.6     ...     0.7

I have 2 CSV files, file crnt2 has 4K rows and file hist2 has 20K rows and I'm using below code to generate the matrix.

import textdistance
import numpy as np
import csv

def read_csv_data(fileName):
    rdr = None
    lst_temp=[]
    with open(fileName,"r") as f:
        rdr = csv.reader(f)
        for r in rdr:
            lst_temp.append(r)

    return lst_temp

c = read_csv_data("crnt2.csv")
h = read_csv_data("hist2.csv")

m = np.zeros((len(h),len(c)), dtype=int)

for i in range(0,len(h)):
    for j in  range(0,len(c)):
        m[i][j] = textdistance.levenshtein.distance(h[i][0],c[j][0])
        
np.savetxt("output.csv",m,delimiter=",")

When I run my python code then it takes around 30 seconds to process one row and 166 hours to produce the complete output.

And when I use R on the same dataset uring stringdistmatrix then it hardly takes 2 to 3 minutes to produce the same output.

> a <- read.csv("crnt2.csv")
> b <- read.csv("hist2.csv")
> c <- stringdistmatrix(a$column1,b$column1, method = c("jw"))
> write.csv(c,file = "output.csv")

The catch is that I'll have to use the Python solution only for this, can't use R so please guide me how can I reduce the time using Python.

Thanks in advance.

  • Does this help: https://stackoverflow.com/questions/37428973/string-distance-matrix-in-python – Mark Jul 14 '20 at 14:31
  • Thank you Mark for the suggestion, however, I followed the same solution but still facing the same issue. Time taken by the Python code is still the same. – Bhoopesh Sharma Jul 14 '20 at 14:34
  • I think reading by using the CSV module is taking the time. Try using pandas to import and write to CSV and let me know if any improvements – Dinesh Jul 14 '20 at 14:38
  • CSV module does't take much time, it hardly takes 20 seconds to read both files. Only it takes time in For loop iterations when it is calculating the distance; each iteration takes around 30sec to process – Bhoopesh Sharma Jul 14 '20 at 14:49
  • What metric do you want to use? Because the R metric is not the same as the python metric ("lv" & "jw" from R are different). Also - If you only want to closest 'K' for say K=1..5, you better use Ball-Tree method to search. Self implemented loops are the native approach, one speedup could be to vectorize one dimension – Willem Hendriks Jul 15 '20 at 07:10

0 Answers0