1

Say I have a pandas dataframe that looks like this:

ID    String1                         String2
1     The big black wolf              The small wolf
2     Close the door on way out       door the Close
3     where's the money               where is the money
4     123 further out                 out further

I want to cross tab each row in columns String1 and String2, before doing a fuzzy string matching, similar to Python fuzzy string matching as correlation style table/matrix.

My challenge is that the solution in the link I posted only works when the number of words in String1 and String2 are same. Secondly that solution looks at all the rows in the column while I want mine to only do a row by row comparison.

Proposed solution should do a matrix like comparison for row 1 like:

       string1     The  big  black  wolf  Maximum
       string2
       The          100  0    0      0     100
       small        0    0    0      0     0
       wolf         0    0    0      100   100
ID    String1                         String2               Matching_Average
1     The big black wolf              The small wolf        66.67
2     Close the door on way out       door the Close
3     where's the money               where is the money
4     123 further out                 out further

where matching average is the sum of 'maximum' column divided by the number of words in String2

Mikee
  • 783
  • 1
  • 6
  • 18
  • *where matching average is the sum of 'maximum' column divided by the number of words in String1* - you meant String2 and not String1? – anky Apr 28 '21 at 14:57
  • That is correct @anky, will edit now. – Mikee Apr 28 '21 at 14:58

1 Answers1

1

You can first get dummies from the 2 series, then get the intersection of columns, sum them up and divide by the second column's dummies:

a = df['String1'].str.get_dummies(' ')
b = df['String2'].str.get_dummies(' ')
u = b[b.columns.intersection(a.columns)]
df['Matching_Average'] = u.sum(1).div(b.sum(1)).mul(100).round(2)

print(df)

   ID                    String1             String2  Matching_Average
0   1         The big black wolf      The small wolf             66.67
1   2  Close the door on way out      door the Close            100.00
2   3          where's the money  where is the money             50.00
3   4            123 further out         out further            100.00

Otherwise if you are okay for a string matching algorithm, you can use difflib:

from difflib import SequenceMatcher
[SequenceMatcher(None,x,y).ratio() for x,y in zip(df['String1'],df['String2'])]
#[0.625, 0.2564102564102564, 0.9142857142857143, 0.6153846153846154]
anky
  • 74,114
  • 11
  • 41
  • 70
  • Thanks @anky, what if I want to use the ```from fuzzywuzzy import fuzz``` to fuzzy match the strings? – Mikee Apr 28 '21 at 16:12
  • Since you just want to get ratio after comparing you want `[fuzz.ratio(x,y) for x,y in zip(df['String1'],df['String2'])]` after importing , you may also try `fuzz.partial_ratio` depending on your [requirement(check this)](https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings) – anky Apr 28 '21 at 16:12
  • Took me a while to figure it out but I think your algorithm is dummy coding all the rows in a and b, before comparing them. This is evident by the result of ID = 2, I would expect a result of 100, not 133 but because of the ```the``` from ID = 3, it gives an higher scoring. I only want to dummy code row by row before comparison. Make sense? – Mikee Apr 29 '21 at 13:57
  • @user1783739 Edited my answer(I had messed up on a part of the code when defining u) , please check now? – anky Apr 29 '21 at 14:01