I have the following DataFrame with single column:
keys
----
'hvo45cj'
'849hydg'
'umh74la'
'glhj5es'
'c8atge'
'trd68b'
...
I am trying to generate another column/array(same length of df) which contains index numbers of elements from same column which are nearly similar(lazy match). for example:
keys | matches |
---|---|
'hvo45cj' | 0, 199 |
'849hydg' | 1, 78, 89 |
and this basic nested loop :
def similar(a, b):
#difflib
return SequenceMatcher(None, a, b).ratio()
indices = []
threshold = .7
for idx in df.index:
#loop for 1st col
index = []
for idxii in df.index:
# for element wise comparision
x = similar(df.col1[idx], df.col[idxii])
# here f(ref, hyp) calculates
# a similarity score between
# two input
#
if x > threshold:
index.append(idxii)
indices.append(index)
with this method it takes more then an 2 hour to compute, in original data have nearly 160k rows.
is there a pandas inbuilt method or any other way to reduce the time complexity of this task?