Avoid duplicate columns while merging with pandas

Question

I have dozens of dataframes I would like to merge with a "reference" dataframe. I want to merge the columns when they exist in both dataframes, or conversely, create a new one when they don't already exist. I have the feeling that this is closely related to this topic but I cannot figure out out to make it work in my case. Also, note that the key used for merging never contains duplicates.

# Reference dataframe
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})

# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
                'potato':[13,21]})

df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
                'carrot':[14,8,32]})

df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00'],
                'potato':[27,31]})


df = df.merge(df1, how='left', on='date_time')
df = df.merge(df2, how='left', on='date_time')
df = df.merge(df3, how='left', on='date_time')

The result is :

              date_time  potato_x  carrot  potato_y
0  2018-06-01 00:00:00       NaN     NaN       NaN
1  2018-06-01 00:30:00      13.0     NaN       NaN
2  2018-06-01 01:00:00      21.0     NaN       NaN
3  2018-06-01 01:30:00       NaN    14.0      27.0

While I would like :

              date_time  potato  carrot 
0  2018-06-01 00:00:00       NaN     NaN  
1  2018-06-01 00:30:00      13.0     NaN   
2  2018-06-01 01:00:00      21.0     NaN 
3  2018-06-01 01:30:00      27.0    14.0

Edit (following @sammywemmy's answer): I have no idea what will be the dataframe columns name before importing them (in a loop). Usually, the dataframes that are merged with my reference dataframe contain about 100 columns, from which 90%-95% are common with the other dataframes.

Every new dataframe to be merge contains about 100 columns. Among these 100 columns, there might be 10 columns that have a name that is not present in the previous dataframes. So, assuming that I want to merge 15 dataframes, I will have at the end 100 columns + 15*10 = 250 columns — Antoine T, Feb 26 '20 at 02:33
it seems the other columns are food names (potato, carrot,...) and the common key is date_time. 100 columns is a lot and i dont see how you can keep track of that. I suggest you write code that melts every dataframe, using date_time as your index_var, then perform the merge. — sammywemmy, Feb 26 '20 at 03:20

Scott Boston · Accepted Answer · 2020-02-26T14:07:48.837

I would pd.concat similar structured dataframes then merge the others like this:

df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
  .merge(df2, on='date_time', how='left')

Output:

             date_time  potato  carrot
0  2018-06-01 00:00:00     NaN     NaN
1  2018-06-01 00:30:00    13.0     NaN
2  2018-06-01 01:00:00    21.0     NaN
3  2018-06-01 01:30:00    27.0    14.0

Per comments below:

df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})

# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
                'potato':[13,21]})

df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
                'carrot':[14,8,32]})

df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})

df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')

Output:

             date_time  potato  zucchini  carrot
0  2018-06-01 00:00:00     NaN       NaN     NaN
1  2018-06-01 00:30:00    13.0       NaN     NaN
2  2018-06-01 01:00:00    21.0       NaN     NaN
3  2018-06-01 01:30:00    27.0      11.0    14.0

I think your solution works only if the merged/concatenated dataframe is either fully similar or diffirent from df. For example, I wouldn't know how to deal with a dataframe such as : `df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})` — Antoine T, Feb 26 '20 at 02:29

score 0 · Answer 2 · answered Feb 26 '20 at 00:36

0

Continuing from your code,
use the filter method to pull out potato related columns,
sum them along the columns axis,
and remove columns that contain potato_...

df['potato'] = df.filter(like='potato').fillna(0).sum(axis=1)

exclude_columns = df.columns.str.contains('potato_[a-z]')
df = df.loc[:,~exclude_columns]

    date_time         carrot    potato
0   2018-06-01 00:00:00 NaN     0.0
1   2018-06-01 00:30:00 NaN     13.0
2   2018-06-01 01:00:00 NaN     21.0
3   2018-06-01 01:30:00 14.0    27.0

answered Feb 26 '20 at 00:36

sammywemmy

27,093
4
17
31

I have no idea what will be the columns name before importing them. To be more precise, every new dataframe contains about 100 columns, from which 90%-95% are common with the other dataframes. I edited my question to add these information. – Antoine T Feb 26 '20 at 00:51

Avoid duplicate columns while merging with pandas

2 Answers2