How can I merge these two datasets on 'Name' and 'Year'?

Question

I am new in this field and stuck on this problem. I have two datasets

all_batsman_df, this df has 5 columns('years','team','pos','name','salary')

       years    team    pos name            salary
0       1991    SF      1B  Will Clark      3750000.0
1       1991    NYY     1B  Don Mattingly   3420000.0
2       1991    BAL     1B  Glenn Davis     3275000.0
3       1991    MIL     DH  Paul Molitor    3233333.0
4       1991    TOR     3B  Kelly Gruber    3033333.0

all_batting_statistics_df, this df has 31 columns

    Year    Rk  Name    Age Tm  Lg  G   PA  AB  R   ... SLG OPS OPS+    TB  GDP HBP SH  SF  IBB Pos Summary
0   1988    1   Glen Davis  22  SDP NL  37  89  83  6   ... 0.289   0.514   48.0    24  1   1   0   1   1   987
1   1988    2   Jim Acker   29  ATL NL  21  6   5   0   ... 0.400   0.900   158.0   2   0   0   0   0   0   1
2   1988    3   Jim Adduci* 28  MIL AL  44  97  94  8   ... 0.383   0.641   77.0    36  1   0   0   3   0   7D/93
3   1988    4   Juan Agosto*    30  HOU NL  75  6   5   0   ... 0.000   0.000   -100.0  0   0   0   1   0   0   1
4   1988    5   Luis Aguayo 29  TOT MLB 99  260 237 21  ... 0.354   0.663   88.0    84  6   1   1   1   3   564

I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.

Now, I want to know that How can I merge both of them using difflib library even it has different names? Any help will be appreciated ... Thanks in advance.

I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.

df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)

df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']

for comp_a, addr_a in df_a[['Year','Name']].values:
    for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
        if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
            df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
        if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
            df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b


merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')

If you know what the names are supposed to be, then first, clean your dataset — Trenton McKinney, May 02 '20 at 19:21
Can you be more specific about what the issue is? Please see [ask], [help/on-topic], and provide a [mcve] as well as the current and expected output. — AMC, May 03 '20 at 02:52

score 0 · Answer 1 · answered May 02 '20 at 19:36

0

You can do

import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
    difflib.get_close_matches(x, df_a['name'])[0])

to replace names in df_b with closest match from df_a, then do your merge. See also this post.

answered May 02 '20 at 19:36

stevemo

1,077
6
10

score 0 · Answer 2 · answered May 02 '20 at 19:52

Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name' okay

1. we will 1st rename all the names which are wrong I hope you know all the wrong names from all_batting_statistics_df using this

all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')

once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long

2. we need both data sets to have the same columns i.e. only 'year' and 'name' use this to drop the columns we don't need

all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])

all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)

I cannot see all the 31 columns so I left them, you have to add to the above code

3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename

df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})

4. next, to merge them

we will use this

all_batsman_df.merge(df_new_1, left_on='year', right_on='name')

FINAL THOUGHTS: If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

How can I merge these two datasets on 'Name' and 'Year'?

2 Answers2