3

I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?

DF 1:

Col1 Col2
BLAH1 A
BLAH2 Z

DF 2:

Col1 Col2 Col3
BLAH2 Z B
BLAH3 Q K

DF 3:

Col1 Col4
BLAH2 C
BLAH3 W

Desired Outcome

Col1 Col2 Col3 Col4
BLAH1 A NaN NaN
BLAH2 Z B C
BLAH3 Q K W
magic_frank
  • 161
  • 1
  • 9
  • related: [pandas three-way joining multiple dataframes on columns](https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns) – anky Dec 13 '21 at 18:22

3 Answers3

4

If the keys are unique, within each dataframe, you can do a concat then groupby:

list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()

In general, you can combine a reduce and merge:

from functools import reduce

# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)

reduce(lambda x,y: x.merge(y, on=commons), list_dfs)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Would both cases you provide handle duplicate columns between the dataframes? Apologies for not first mentioning that is also a possibility - I have updated the question. – magic_frank Dec 13 '21 at 18:22
  • Either way, you would need to identify the common columns first. Then replace the list in `'Col1'` place. See updated answer (second approach) – Quang Hoang Dec 13 '21 at 18:22
  • 1
    if Col1 is the common one i want to merge on, but col2 appears in multiple dataframes in the list how is that best handled? Currently i use suffixes=('', '_y') then df.drop(df.filter(regex='_y$').columns.tolist(),axis=1, inplace=True) – magic_frank Dec 13 '21 at 18:27
  • That's gonna be difficult, you should try a sample set of data where `Col2` is only in `df2` and `df3`, and ask what do you want from that? – Quang Hoang Dec 13 '21 at 18:29
  • I want it exactly as I have shown in the edited question. Col2 should take the values in the dataframes (values across the dataframes, if duplicate columns, will be the same) – magic_frank Dec 13 '21 at 18:31
  • as long as the data are unique by `Col1`, first approach would be fine. – Quang Hoang Dec 13 '21 at 18:33
2

We can use reduce and merge like so :

from functools import reduce

reduce(lambda left, right: pd.merge(left, right, on=list(left.columns.intersection(right.columns)), how='outer'), [df1, df2, df3])

Here the reduce apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left, is the accumulated value and the right argument, right, is the update value from the iterable.

The trick here is to merge on the list of common columns from the DataFrames to get the expected result. @QuangHoang did something similar and find the trick before me.

Output :

    Col1    Col2    Col3    Col4
0   BLAH1   A       NaN     NaN
1   BLAH2   Z       B       C
2   BLAH3   Q       K       W
tlentali
  • 3,407
  • 2
  • 14
  • 21
2

Let's use functools.reduce with pd.DataFrame.combine_first and comprehension:

from functools import reduce
reduce(lambda x, y: x.combine_first(y), 
       (df.set_index('Col1') for df in [df1,df2,df3])).reset_index()

Output:

    Col1 Col2 Col3 Col4
0  BLAH1    A  NaN  NaN
1  BLAH2    Z    B    C
2  BLAH3    Q    K    W

Given input dataframes as:

df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
                   'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                   'Col2':[*'ZQ'],
                   'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                    'Col4':[*'CW']})
Scott Boston
  • 147,308
  • 15
  • 139
  • 187