Best way to concat/merge list of pandas dataframes into one while preserving all values in a given column

Question

I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?

DF 1:

Col1	Col2
BLAH1	A
BLAH2	Z

DF 2:

Col1	Col2	Col3
BLAH2	Z	B
BLAH3	Q	K

DF 3:

Col1	Col4
BLAH2	C
BLAH3	W

Desired Outcome

Col1	Col2	Col3	Col4
BLAH1	A	NaN	NaN
BLAH2	Z	B	C
BLAH3	Q	K	W

related: [pandas three-way joining multiple dataframes on columns](https://stackoverflow.com/questions/23668427/pandas-three-way-joining-multiple-dataframes-on-columns) — anky, Dec 13 '21 at 18:22

Quang Hoang · Answer 1 · 2021-12-13T18:27:16.643

4

If the keys are unique, within each dataframe, you can do a concat then groupby:

list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()

In general, you can combine a reduce and merge:

from functools import reduce

# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)

reduce(lambda x,y: x.merge(y, on=commons), list_dfs)

edited Dec 13 '21 at 18:27

answered Dec 13 '21 at 18:05

Quang Hoang

146,074
10
56
74

Would both cases you provide handle duplicate columns between the dataframes? Apologies for not first mentioning that is also a possibility - I have updated the question. – magic_frank Dec 13 '21 at 18:22
Either way, you would need to identify the common columns first. Then replace the list in `'Col1'` place. See updated answer (second approach) – Quang Hoang Dec 13 '21 at 18:22
1

if Col1 is the common one i want to merge on, but col2 appears in multiple dataframes in the list how is that best handled? Currently i use suffixes=('', '_y') then df.drop(df.filter(regex='_y$').columns.tolist(),axis=1, inplace=True) – magic_frank Dec 13 '21 at 18:27
That's gonna be difficult, you should try a sample set of data where `Col2` is only in `df2` and `df3`, and ask what do you want from that? – Quang Hoang Dec 13 '21 at 18:29
I want it exactly as I have shown in the edited question. Col2 should take the values in the dataframes (values across the dataframes, if duplicate columns, will be the same) – magic_frank Dec 13 '21 at 18:31
as long as the data are unique by `Col1`, first approach would be fine. – Quang Hoang Dec 13 '21 at 18:33

tlentali · Answer 2 · 2021-12-13T18:46:21.320

We can use reduce and merge like so :

from functools import reduce

reduce(lambda left, right: pd.merge(left, right, on=list(left.columns.intersection(right.columns)), how='outer'), [df1, df2, df3])

Here the reduce apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left, is the accumulated value and the right argument, right, is the update value from the iterable.

The trick here is to merge on the list of common columns from the DataFrames to get the expected result. @QuangHoang did something similar and find the trick before me.

Output :

    Col1    Col2    Col3    Col4
0   BLAH1   A       NaN     NaN
1   BLAH2   Z       B       C
2   BLAH3   Q       K       W

score 2 · Answer 3 · answered Dec 13 '21 at 18:32

Let's use functools.reduce with pd.DataFrame.combine_first and comprehension:

from functools import reduce
reduce(lambda x, y: x.combine_first(y), 
       (df.set_index('Col1') for df in [df1,df2,df3])).reset_index()

Output:

    Col1 Col2 Col3 Col4
0  BLAH1    A  NaN  NaN
1  BLAH2    Z    B    C
2  BLAH3    Q    K    W

Given input dataframes as:

df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
                   'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                   'Col2':[*'ZQ'],
                   'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                    'Col4':[*'CW']})

Best way to concat/merge list of pandas dataframes into one while preserving all values in a given column

3 Answers3