How to merge to two pandas data frames?

Question

I have two pandas data frames (see below).I want to merge them based on the id (Dataframe1) and localid(Dataframe2). This code is not working; it creates additional rows in dfmerged as Dataframe2 may contains multiple same localid(e.g., D3). How can I merge these two dataframes and set the value of the 'color' column as NaN if the localid does not exists in the first dataframe (DataFrame1)?

dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')

You'll first need to de-duplicate the ids in df2 by combining the colours into a single list, secondly you need to pass `how='outer'` if you want all ids in the final merged df by default it's `inner` so only ids that are present in both will be merged — EdChum, Oct 06 '16 at 09:15

jezrael · Accepted Answer · 2016-10-06T09:25:52.947

2

I think you need groupby and sum values in list in df2 and then use merge with drop column localid:

df1 = pd.DataFrame({'id':['D1','D2','D3','D4','D5','D6'],
                   'Field1':[12,15,11,7,55,8.8]})

print (df1)
   Field1  id
0    12.0  D1
1    15.0  D2
2    11.0  D3
3     7.0  D4
4    55.0  D5
5     8.8  D6

df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                   'color':[['b'],['a'],['a','b'],['s','d'], ['a']]})

print (df2)
    color localid
0     [b]      D1
1     [a]      D2
2  [a, b]      D3
3  [s, d]      D3
4     [a]      D9

df2 = df2.groupby('localid', as_index=False)['color'].sum()
print (df2)
  localid         color
0      D1           [b]
1      D2           [a]
2      D3  [a, b, s, d]
3      D9           [a]


dfmerged = pd.merge(df1, 
                    df2, 
                    left_on='id', 
                    right_on='localid', 
                    how='left')
             .drop('localid', axis=1)

print (dfmerged)
   Field1  id         color
0    12.0  D1           [b]
1    15.0  D2           [a]
2    11.0  D3  [a, b, s, d]
3     7.0  D4           NaN
4    55.0  D5           NaN
5     8.8  D6           NaN

edited Oct 06 '16 at 09:25

answered Oct 06 '16 at 09:19

jezrael

822,522
95
1,334
1,252

the D3 color values should be [a,b,s,d]. – kitchenprinzessin Oct 06 '16 at 09:25
I am just about to add the groupby statement, you are quick! thanks :) – kitchenprinzessin Oct 06 '16 at 09:28
Thank you for accepting! And small advice - check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and dont use pictures, because then is impossible copy data. ;) – jezrael Oct 06 '16 at 09:29
just started with python programming, thanks for the examples ^O^ – kitchenprinzessin Oct 06 '16 at 09:35

score 0 · Answer 2 · answered Oct 06 '16 at 09:36

You should probably simplify df2 to have no repeating keys, and then tell pd.merge to use union of keys from both frames (with how:'outer'):

import pandas as pd
df1 = pd.DataFrame({    'id':['D1','D2','D3','D4','D5','D6'],
                    'Field1':[  12,  15,  11,   7,  55, 8.8]})
df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                      'color':[['blue','grey'],
                               ['yellow'],
                               ['black','red','green'],
                               ['white'],
                               ['blue']]})
dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')
dfmerged2 = pd.merge(df1, df2, left_on='id', right_on='localid', how='outer')

Which results in:

>>> dfmerged2
   Field1   id                color localid
0    12.0   D1         [blue, grey]      D1
1    15.0   D2             [yellow]      D2
2    11.0   D3  [black, red, green]      D3
3    11.0   D3              [white]      D3
4     7.0   D4                  NaN     NaN
5    55.0   D5                  NaN     NaN
6     8.8   D6                  NaN     NaN
7     NaN  NaN               [blue]      D9

How to merge to two pandas data frames?

2 Answers2