0

I have a situation where I need to merge multiple dataframes that I can do easily using the below code:

# Merge all the datasets together
df_prep1 = df_prep.merge(df1,on='e_id',how='left')
df_prep2 = df_prep1.merge(df2,on='e_id',how='left')
df_prep3 = df_prep2.merge(df3,on='e_id',how='left')
df_prep4 = df_prep3.merge(df_4,on='e_id',how='left')
df_prep5 = df_prep4.merge(df_5,on='e_id',how='left')
df_prep6 = df_prep5.merge(df_6,on='e_id',how='left')

But what I want to understand is that if there is any other efficient way to perform this merge, maybe using a helper function? If yes, then how could I achieve that?

Django0602
  • 797
  • 7
  • 26
  • 1
    It depends on your dataframes... Same columns or not? Without a sample of each of them, it's hard to answer. – Corralien Jul 13 '21 at 21:58
  • What I am trying to ask here is that if there is an efficient way to write a helper function and pass all these data merge statements into it. I believe that it doesn't matter as to what all cols are getting merged as long as I have the final output getting created correctly. Or am I missing something here? – Django0602 Jul 13 '21 at 22:04
  • 1
    Always avoid similarly structured objects (like number-suffixed sets) flooding global environment. Use `list` or `dict` containers for management and organization. – Parfait Jul 14 '21 at 00:15

2 Answers2

0

You can use reduce from functools module to merge multiple dataframes:

from functools import reduce

dfs = [df_1, df_2, df_3, df_4, df_5, df_6]

out = reduce(lambda dfl, dfr: pd.merge(dfl, dfr, on='e_id', how='left'), dfs)
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You can put all your dfs into a list, or pass them from a function, a loop, etc. and then have 1 main df that you merge everything onto.

You can start with an empty df and iterate through. In your case, since you are doing left merge, it looks like your df_prep should already have all of the e_id values that you want. You'll need to figure out what you want to do with any additional columns, e.g., you can have pandas add _x and _y after conflicting column names that you don't merge, or rename them, etc. See this toy example:

main_df = pd.DataFrame({'e_id': [0, 1, 2, 3, 4]})

for x in range(3):
    dfx = pd.DataFrame({'e_id': [x], 'another_col' + str(x): [x * 10]})
    main_df = main_df.merge(dfx, on='e_id', how='left')

to get:

   e_id  another_col0  another_col1  another_col2
0     0           0.0           NaN           NaN
1     1           NaN          10.0           NaN
2     2           NaN           NaN          20.0
3     3           NaN           NaN           NaN
4     4           NaN           NaN           NaN
Hammurabi
  • 1,141
  • 1
  • 4
  • 7
  • Avoid growing objects in a loop with methods like `merge` and `concat`. This can lead to [quadratic copy](https://stackoverflow.com/a/36489724/1422451). – Parfait Jul 14 '21 at 00:13