3

What is the most efficient way to merge multiple data frames (i.e., more than 2) in pandas? There are a few answers:

  1. pandas joining multiple dataframes on columns
  2. Pandas left outer join multiple dataframes on multiple columns

but these all involve multiple joins. If I have N data frames these would require N-1 joins.

If I weren't using pandas, another solution would be to just put everything into a hash table based on the common index as the key and build the final version. This is basically like a hash join in SQL I believe. Is there something like that in pandas?

If not, would it be more efficient to just create a new data frame with the common index and pass it the raw data from each data frame? It seems like that would at least prevent you from creating a new data frame in each of the N-1 joins.

Thanks.

Community
  • 1
  • 1
Cmdt.Ed
  • 41
  • 1
  • 2

1 Answers1

4

if you can join your data frames by index you can do it in one conveyor:

df1.join(df2).join(df3).join(df4)

example:

In [187]: df1
Out[187]:
   a  b
0  5  2
1  6  7
2  6  5
3  1  6
4  0  2

In [188]: df2
Out[188]:
   c  d
0  5  7
1  5  5
2  2  4
3  4  3
4  9  0

In [189]: df3
Out[189]:
   e  f
0  8  1
1  0  9
2  4  5
3  3  9
4  9  5

In [190]: df1.join(df2).join(df3)
Out[190]:
   a  b  c  d  e  f
0  5  2  5  7  8  1
1  6  7  5  5  0  9
2  6  5  2  4  4  5
3  1  6  4  3  3  9
4  0  2  9  0  9  5

It should be pretty fast and effective

alternatively you can concatenate them:

In [191]: pd.concat([df1,df2,df3], axis=1)
Out[191]:
   a  b  c  d  e  f
0  5  2  5  7  8  1
1  6  7  5  5  0  9
2  6  5  2  4  4  5
3  1  6  4  3  3  9
4  0  2  9  0  9  5

Time comparison for 3 DF's with 100K rows each:

In [198]: %timeit pd.concat([df1,df2,df3], axis=1)
100 loops, best of 3: 5.67 ms per loop

In [199]: %timeit df1.join(df2).join(df3)
100 loops, best of 3: 3.93 ms per loop

so as you can see join is bit faster

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419