If the column(s) that you wish to merge on have unique values, then
there is a much faster way: set the column(s) you wish to merge on as the index and then concatenate all the dataframes with pd.concat
:
import itertools as IT
import numpy as np
import functools
count = IT.count()
M, N, P = 100, 10, 4
dfs = [pd.DataFrame({
next(count): np.random.randint(4, size=(N)) for j in range(P)}) for i in range(M)]
for i in range(M):
dfs[i]['foo'] = np.random.choice(N, size=N, replace=False)
def using_merge(dfs):
result = dfs[0]
for df in dfs[1:]:
result = pd.merge(result, df, on='foo')
return result
def using_reduce(dfs):
return functools.reduce(lambda left,right:
pd.merge(left, right, on=['foo']), dfs)
def using_concat(dfs):
return pd.concat([df.set_index('foo') for df in dfs], axis=1)
The problem with calling merge
in a loop is that it returns an intermediate DataFrame which requires copying values from the left
and right
DataFrames. When done in a loop this leads to quadraticly increasing amounts of copying.
When the index is unique, pd.concat
can be used to avoid the quadratic copying -- the values from all the DataFrames in dfs
only need to be copied once into the result.
Here is a microbenchmark for the example above.
In [160]: %timeit using_concat(dfs)
10 loops, best of 3: 81.2 ms per loop
In [161]: %timeit using_merge(dfs)
1 loop, best of 3: 660 ms per loop
In [162]: %timeit using_reduce(dfs)
1 loop, best of 3: 659 ms per loop
The speed advantage is variable -- it depends on the number of DataFrames, M
,
to be merged. As M
increases, so will the speed
advantage of using_concat
over using_merge
or using_reduce
. But let me
emphasize again that pd.concat
can be used as a substitute for pd.merge
only
when the column values are unique -- that is, the merge is 1-to-1, not
many-to-one or one-to-many.
If the columns to be merged on do not have unique values for each DataFrame, then
I don't see a faster way to compute the desired result than to call pd.merge
in a loop.