1

I have two pandas data frames (see below).I want to merge them based on the id (Dataframe1) and localid(Dataframe2). This code is not working; it creates additional rows in dfmerged as Dataframe2 may contains multiple same localid(e.g., D3). How can I merge these two dataframes and set the value of the 'color' column as NaN if the localid does not exists in the first dataframe (DataFrame1)?

dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')

enter image description here

kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30
  • Can you add desired output? – jezrael Oct 06 '16 at 09:15
  • 1
    You'll first need to de-duplicate the ids in df2 by combining the colours into a single list, secondly you need to pass `how='outer'` if you want all ids in the final merged df by default it's `inner` so only ids that are present in both will be merged – EdChum Oct 06 '16 at 09:15

2 Answers2

2

I think you need groupby and sum values in list in df2 and then use merge with drop column localid:

df1 = pd.DataFrame({'id':['D1','D2','D3','D4','D5','D6'],
                   'Field1':[12,15,11,7,55,8.8]})

print (df1)
   Field1  id
0    12.0  D1
1    15.0  D2
2    11.0  D3
3     7.0  D4
4    55.0  D5
5     8.8  D6

df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                   'color':[['b'],['a'],['a','b'],['s','d'], ['a']]})

print (df2)
    color localid
0     [b]      D1
1     [a]      D2
2  [a, b]      D3
3  [s, d]      D3
4     [a]      D9
df2 = df2.groupby('localid', as_index=False)['color'].sum()
print (df2)
  localid         color
0      D1           [b]
1      D2           [a]
2      D3  [a, b, s, d]
3      D9           [a]


dfmerged = pd.merge(df1, 
                    df2, 
                    left_on='id', 
                    right_on='localid', 
                    how='left')
             .drop('localid', axis=1)

print (dfmerged)
   Field1  id         color
0    12.0  D1           [b]
1    15.0  D2           [a]
2    11.0  D3  [a, b, s, d]
3     7.0  D4           NaN
4    55.0  D5           NaN
5     8.8  D6           NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You should probably simplify df2 to have no repeating keys, and then tell pd.merge to use union of keys from both frames (with how:'outer'):

import pandas as pd
df1 = pd.DataFrame({    'id':['D1','D2','D3','D4','D5','D6'],
                    'Field1':[  12,  15,  11,   7,  55, 8.8]})
df2 = pd.DataFrame({'localid':['D1','D2','D3','D3','D9'],
                      'color':[['blue','grey'],
                               ['yellow'],
                               ['black','red','green'],
                               ['white'],
                               ['blue']]})
dfmerged = pd.merge(df1, df2, left_on='id', right_on='localid')
dfmerged2 = pd.merge(df1, df2, left_on='id', right_on='localid', how='outer')

Which results in:

>>> dfmerged2
   Field1   id                color localid
0    12.0   D1         [blue, grey]      D1
1    15.0   D2             [yellow]      D2
2    11.0   D3  [black, red, green]      D3
3    11.0   D3              [white]      D3
4     7.0   D4                  NaN     NaN
5    55.0   D5                  NaN     NaN
6     8.8   D6                  NaN     NaN
7     NaN  NaN               [blue]      D9
berna1111
  • 1,811
  • 1
  • 18
  • 23