1

I have one list A as below.

A = ['vikash','vikas','Vinod',Vikky','Akash','Vinodh','Sachin','Salman,'Ajay','Suchin','Akash','vikahs']

I want to match each element in the list with each element and find the fuzzy matching strings of each element with matching ratio 90% or above and count of matching elements.

My result should be as below in data frame.

string  Matching strings count
===============================
Vikash  vikas,vikahs      2
vikas   vikash,vikahs     2
vinod   vinodh            1
Vikky                     0
Akash   Akash             1
...
..
Vikahs vikash,vikas       2

Could any one help me to achieve that as I am new in python?

Thanks

cubick
  • 293
  • 3
  • 13
  • What do you mean with a matching ratio of 90% or above? There are different kind of a matching ratio e.g. in fuzzywuzzy. – maxbachmann Apr 08 '20 at 11:24
  • When you have two elements with a score over 90% should they both be in the result and incremented by one, or just one of them? – maxbachmann Apr 08 '20 at 11:30
  • I am using ratio to check fuzzy matching ratio . from fuzzywuzzy import fuzz fuzz.ratio(str1,str2) and if more than 1 are above 90% than both of them should be in result and incremented. – Vikash Chauradia Apr 08 '20 at 12:16
  • Thanks @maxbachmann for your quick reply , I really appreciate your time and expertise advise. – Vikash Chauradia Apr 08 '20 at 12:18

1 Answers1

6

This can be implemented using FuzzyWuzzy the following way:

import pandas as pd
from fuzzywuzzy import fuzz

elements = ['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay', 'Suchin', 'Akash', 'vikahs']

results = [[name, [], 0] for name in elements]

for (i, element) in enumerate(elements):
    for (j, choice) in enumerate(elements[i+1:]):
        if fuzz.ratio(element, choice) >= 90:
            results[i][2] += 1
            results[i][1].append(choice)
            results[j+i+1][2] += 1
            results[j+i+1][1].append(element)

data = pd.DataFrame(results, columns=['name', 'duplicates', 'duplicate_count'])

As an alternative I wrote the library RapidFuzz, which is faster while returning the same results as FuzzyWuzzy and can be implemented the following way:

import pandas as pd
from rapidfuzz import fuzz

elements = ['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay', 'Suchin', 'Akash', 'vikahs']

results = [[name, [], 0] for name in elements]

for (i, element) in enumerate(elements):
    for (j, choice) in enumerate(elements[i+1:]):
        if fuzz.ratio(element, choice, score_cutoff=90):
            results[i][2] += 1
            results[i][1].append(choice)
            results[j+i+1][2] += 1
            results[j+i+1][1].append(element)

data = pd.DataFrame(results, columns=['name', 'duplicates', 'duplicate_count'])

I did run a quick benchmark to show the runtime difference between the two on 1000 runs each:

# FuzzyWuzzy
0.13835792080499232

# RapidFuzz
0.03843669104389846

The output of both of them is:

      name        duplicates  duplicate_count
0   vikash           [vikas]                1
1    vikas  [vikash, vikahs]                2
2    Vinod          [Vinodh]                1
3    Vikky                []                0
4    Akash           [Akash]                1
5   Vinodh           [Vinod]                1
6   Sachin                []                0
7   Salman                []                0
8     Ajay                []                0
9   Suchin                []                0
10   Akash           [Akash]                1
11  vikahs           [vikas]                1
maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • after installing rapidfuzz library I am getting below error.. --------------------------------------------------------------------------- ImportError Traceback (most recent call last) in 35 gc.collect() 36 ---> 37 from rapidfuzz import fuzz C:\ProgramData\Anaconda3\lib\site-packages\rapidfuzz\__init__.py in 2 rapid string matching library ImportError: DLL load failed: The specified module could not be found. – Vikash Chauradia Apr 08 '20 at 14:25
  • Can you open an issue here: https://github.com/maxbachmann/rapidfuzz/issues – maxbachmann Apr 08 '20 at 14:29
  • Also with above function , I am able to get the count of matched string but I need list of matched string as well .. – Vikash Chauradia Apr 08 '20 at 14:41
  • vikash 1 vikas 2 Vinod 1 Vikky 0 Akash 1 Vinodh 1 Sachin 0 Salman 0 Ajay 0 Suchin 0 Akash 1 vikahs 1 How can I get matching string as well along with this result ? Here in this result I am getting string and count of matched string .. – Vikash Chauradia Apr 08 '20 at 14:44
  • Also could you please help me how to fix DLL load failed error of rapidfuzz ? – Vikash Chauradia Apr 08 '20 at 14:45
  • Well it is a list of the result and a list of the elements in a series. I am not really working a lot with pandas, so I am not quite sure in which form you would like the results. Your text of how you would like to have the results is simply a string without any structure that would tell me where to put it in – maxbachmann Apr 08 '20 at 15:02
  • For the dll error it would be good if you could open an issue here: https://github.com/maxbachmann/rapidfuzz/issues, so I can try this out. As a first idea you might need to install https://visualstudio.microsoft.com/visual-cpp-build-tools since it is using C++. – maxbachmann Apr 08 '20 at 15:04
  • I want output with 3 columns in a series , 1.self string ,2. Matching strings (comma separated) and 3. count of matching strings . – Vikash Chauradia Apr 08 '20 at 15:49
  • Thanks a ton @maxbachmann .. You are really savior for me. My query is answered now. – Vikash Chauradia Apr 09 '20 at 06:40