Say I have a pandas dataframe that looks like this:
ID String1 String2
1 The big black wolf The small wolf
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
I want to cross tab each row in columns String1 and String2, before doing a fuzzy string matching, similar to Python fuzzy string matching as correlation style table/matrix.
My challenge is that the solution in the link I posted only works when the number of words in String1 and String2 are same. Secondly that solution looks at all the rows in the column while I want mine to only do a row by row comparison.
Proposed solution should do a matrix like comparison for row 1 like:
string1 The big black wolf Maximum
string2
The 100 0 0 0 100
small 0 0 0 0 0
wolf 0 0 0 100 100
ID String1 String2 Matching_Average
1 The big black wolf The small wolf 66.67
2 Close the door on way out door the Close
3 where's the money where is the money
4 123 further out out further
where matching average is the sum of 'maximum' column divided by the number of words in String2