1

Consider the following data frames:

import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
    A id
0  A0  f
1  A1  g
2  A2  h
3  A3  i
4  A4  j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
    B id
0  B0  f
1  B1  g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
    B id
0  B3  i
1  B4  j

I want to merge them to get

    A id    B
0  A0  f   B0
1  A1  g   B1
2  A2  h  NaN
3  A3  i   B3
4  A4  j   B4

Inspired by this answer I tried

final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])

but unfortunately it yields

    A id  B_x  B_y
0  A0  f   B0  NaN
1  A1  g   B1  NaN
2  A2  h  NaN  NaN
3  A3  i  NaN   B3
4  A4  j  NaN   B4

Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.

In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.

Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).

cs95
  • 379,657
  • 97
  • 704
  • 746
d4tm4x
  • 458
  • 4
  • 14

1 Answers1

1

Consider pd.concat + groupby?

pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()

  id   A    B
0  f  A0   B0
1  g  A1   B1
2  h  A2  NaN
3  i  A3   B3
4  j  A4   B4
cs95
  • 379,657
  • 97
  • 704
  • 746
  • This solution also works for `df3 = pd.DataFrame({'id': list('ijk'), 'B': ['B' + str(i) for i in range(3, 6)]})` (a case I forgot in my minimal example). Thanks! – d4tm4x Mar 26 '18 at 09:59