1

I am trying to calculate fuzz ratios for multiple rows in 2 data frames:

df1:

id    name
1     Ab Cd E
2     X.Y!Z
3     fgh I

df2:

name_2
abcde
xyz

I want to calculate the fuzz ratio between all the values in df1.name and df2.name_2:

To do that I have code:

for i in df1['name']:
    for r in df2['name_2']:
        print(fuzz.ratio(i,r))

But I want the final result to have the ids from df1 as well. It would ideally look like this:

final_df:

id      name        name_2    score
1      Ab Cd E      abcde      50
1      Ab Cd E       xyz        0
2       X.Y!Z       abcde       0
2       X.Y!Z        xyz       60
3       fgh I       abcde       0
3       fgh I        xyz        0

Thanks for the help!

Sravee
  • 113
  • 5
  • 14
  • You can try something like this: `for i, r in zip(df1['name'], df2['name_2']): ...` This will only work for what you want if both dataframes are the same length, but should work for your example. – L. MacKenzie Oct 13 '17 at 20:33
  • Thank you! But my second dataframe has only 2 rows as compared to the first one which has 3. – Sravee Oct 13 '17 at 20:35
  • @L.MacKenzie additionally I also want the ids from the first dataframe. Thank you for the help! – Sravee Oct 13 '17 at 20:43

2 Answers2

1

You can solve your problem like this:

Create an empty DataFrame:

final = pandas.DataFrame({'id': [], 'name': [], 'name_2': [], 'score': []})

Iterate through the two DataFrames inserting the id, names, and score and concatenating it onto the final DataFrame:

for id, name in zip(df1['id'], df1['name']):
    for name2 in df2['name_2']:
        tmp = pandas.DateFrame({'id': id, 'name': name, 'name_2': name2, 'score': fuzz.ratio(name, name2)})
    final = pandas.concat([final, tmp], ignore_index=True)

print(final)

There is probably a cleaner and more efficient way to do this, but I hope this helps.

L. MacKenzie
  • 493
  • 4
  • 14
0

I don't fully understand the application of lambda functions in pd.apply, but after some SO searching, I think this is a reasonable solution.

import pandas as pd
from fuzzywuzzy import fuzz

d = [{'id': 1, 'name': 'Ab Cd e'}, {'id': 2, 'name': 'X.Y!Z'}, {'id': 3, 'name': 'fgh I'}] 
df1 = pd.DataFrame(d)
df2 = pd.DataFrame({'name_2': ['abcde', 'xyz']})

This is a cross join in pandas; a tmp df is required pandas cross join no columns in common

df1['tmp'] = 1
df2['tmp'] = 1

df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)

You can .apply the function fuzz.ratio to columns in the df. Pandas: How to use apply function to multiple columns

df['fuzz_ratio'] = df.apply(lambda row: fuzz.ratio(row['name'], row['name_2']), axis = 1)

df

I also tried setting an index on df1, but that resulted in its exclusion from the cross-joined df.

Evan
  • 2,121
  • 14
  • 27