0

I have the following DataFrame with single column:

 keys
 ----  
'hvo45cj'        
'849hydg'         
'umh74la'         
'glhj5es'        
'c8atge'      
'trd68b'        
...

I am trying to generate another column/array(same length of df) which contains index numbers of elements from same column which are nearly similar(lazy match). for example:

keys matches
'hvo45cj' 0, 199
'849hydg' 1, 78, 89

and this basic nested loop :

def similar(a, b):
    #difflib
    return SequenceMatcher(None, a, b).ratio()

indices = []
threshold = .7
for idx in df.index:
#loop for 1st col
    
    index = []
    for idxii in df.index:
    # for element wise comparision

        x = similar(df.col1[idx], df.col[idxii])
        # here f(ref, hyp) calculates
        # a similarity score between 
        # two input
        # 
        
        if x > threshold:
            index.append(idxii)
    
    indices.append(index)

with this method it takes more then an 2 hour to compute, in original data have nearly 160k rows.

is there a pandas inbuilt method or any other way to reduce the time complexity of this task?

chronus
  • 15
  • 1
  • 4
  • 2
    Do you need to compare the same indices twice? For example index 1 and 2 ... then 2 and 1? Also, consider [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) module. – Andrej Kesely Apr 29 '21 at 18:53
  • Would be helpful to know how the similarity score is calculated. – Ghoti Apr 29 '21 at 18:53
  • 1
    Just an idea: Maybe you could one-hot encode all keys and they perform DiceSimilarity match (RDKIT library). I use it to search similar chemical structures which are also one-hot-encoded (its quite fast). – John Mommers Apr 29 '21 at 19:00
  • Please supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. [Include a minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of your MRE. – Prune Apr 29 '21 at 19:04
  • thank you for replies, comparing same indices twice is not necessary, i will implement multiprocessing and @john-mommers 's one hot advise – chronus Apr 29 '21 at 19:12

0 Answers0