I want to generate a string distance matrix using python as below.
str1 str2 str3 str4 ... str4k
str1 0.8 0.4 0.6 0.1 ... 0.2
str2 0.4 0.7 0.5 0.1 ... 0.1
str3 0.6 0.5 0.6 0.1 ... 0.1
str4 0.1 0.1 0.1 0.5 ... 0.6
. . . . . ... .
. . . . . ... .
. . . . . ... .
str20k 0.2 0.1 0.1 0.6 ... 0.7
I have 2 CSV files, file crnt2 has 4K rows and file hist2 has 20K rows and I'm using below code to generate the matrix.
import textdistance
import numpy as np
import csv
def read_csv_data(fileName):
rdr = None
lst_temp=[]
with open(fileName,"r") as f:
rdr = csv.reader(f)
for r in rdr:
lst_temp.append(r)
return lst_temp
c = read_csv_data("crnt2.csv")
h = read_csv_data("hist2.csv")
m = np.zeros((len(h),len(c)), dtype=int)
for i in range(0,len(h)):
for j in range(0,len(c)):
m[i][j] = textdistance.levenshtein.distance(h[i][0],c[j][0])
np.savetxt("output.csv",m,delimiter=",")
When I run my python code then it takes around 30 seconds to process one row and 166 hours to produce the complete output.
And when I use R on the same dataset uring stringdistmatrix then it hardly takes 2 to 3 minutes to produce the same output.
> a <- read.csv("crnt2.csv")
> b <- read.csv("hist2.csv")
> c <- stringdistmatrix(a$column1,b$column1, method = c("jw"))
> write.csv(c,file = "output.csv")
The catch is that I'll have to use the Python solution only for this, can't use R so please guide me how can I reduce the time using Python.
Thanks in advance.