1

I have two dataframes with different rows numbers contain information about players. The first has all names that I need.

df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})

The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.

df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
                   'Age': [22, 21, 26, 18, 19, 25]})

The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:

print(final_df)
           Player Team  Age
0       John Sepi    A   22
1        Zan Fred    C   XX
2     Mark Daniel    E   21
3        Adam Pop    C   XX
4       Paul Sepi    B   XX
5  John Hernandez    D   26
6    Price Josiah    B   18
7  John Hernandez    A   19
8        Adam Pop    D   25

Neto
  • 15
  • 4
  • Does this answer your question? [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – Chris May 18 '22 at 19:44
  • i tried the solution in this post, but didnt work.. – Neto May 18 '22 at 22:20

2 Answers2

2

You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.

First, import the following libraries :

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

You can use the text matching capabilities of the fuzzywuzzy python library :

#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
    #iterration on dataframe filtered by team
    for index, row in df1.loc[df1['Team']==team].iterrows():
        #get list of players in this team
        lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
        #use of fuzzywuzzy to make text matching
        output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
        #check if there is players from df2 in this team
        if output_ratio !=[]:
            #put arbitrary threshold to get most similar text
            if output_ratio[0][1]>thres:
                df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')

with this code and a threshold defined as 70, you get the following result:

print(df1)
           Player Team Age
0       John Sepi    A  22
1        Zan Fred    C  XX
2     Mark Daniel    E  21
3        Adam Pop    C  XX
4       Paul Sepi    B  XX
5  John Hernandez    D  26
6    Price Josiah    B  18
7  John Hernandez    A  19
8        Adam Pop    D  25

It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.

Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.

You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/

  • getting a list of unique player from df2 will remove important data and its necessary link team in fuzzy matching. too – Neto May 24 '22 at 06:05
  • can you be more specific about the issue it raises? you can try taking a list without the np.unique – Marcel_SweetGazelle276 May 25 '22 at 14:07
  • Thanks for reply. I'm very new in this, sorry if i don't understand anything. Its missing players in final result and i want to keep players name and team from my df1, i edited the post and tried to elucidate the situation, and printe the desire result. Thanks again – Neto May 25 '22 at 15:37
  • just changed the answer, it works for you problem. Please specify your issue if there is one. – Marcel_SweetGazelle276 May 31 '22 at 11:04
  • No its perfect now, thanks – Neto May 31 '22 at 23:58
0

here is one way:

df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)
eshirvana
  • 23,227
  • 3
  • 22
  • 38