Python: Merging many dataframes in most efficient way possible

Question

Right now I have many different statistics with names attached all in separate dataframes. In order to merge do I have to keep rewriting to a new dataframe? Is there a more efficient way to do this?

Does pd.merge make it easier if the names of the columns are the same when merging?

Do I have to recursively write

pd.merge(left=something, right=somethingelse, left_on='name', right_on='site')

Are all of the data frames being joined on the same column ? — Uri Goren, Aug 14 '17 at 20:07
Look here for various options: https://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns — Anton vBR, Aug 14 '17 at 20:14
Look at using pd.concat, you and concat a list of dataframes with proper indexes. — Scott Boston, Aug 14 '17 at 20:16

score 3 · Answer 1 · answered Aug 14 '17 at 20:27

you can do it by first creating a list of all your dataframes and then get the rsult using the reduce function

# create some data
columns = ['v1','v2','v3']
df1 = pd.DataFrame(np.random.randint(10, size=(3,3)),columns=columns)
df2 = pd.DataFrame(np.random.randint(10, size=(3,3)),columns=columns)
df3 = pd.DataFrame(np.random.randint(10, size=(3,3)),columns=columns)

dfs = [df1,df2,df3] # store in one list
df_merge = reduce(lambda  left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

classic application for `reduce` (+1) – Uri Goren Aug 14 '17 at 20:31 — Uri Goren, Aug 14 '17 at 20:31

unutbu · Answer 2 · 2017-08-14T23:59:31.743

If the column(s) that you wish to merge on have unique values, then there is a much faster way: set the column(s) you wish to merge on as the index and then concatenate all the dataframes with pd.concat:

import itertools as IT
import numpy as np
import functools

count = IT.count()
M, N, P = 100, 10, 4
dfs = [pd.DataFrame({
    next(count): np.random.randint(4, size=(N)) for j in range(P)}) for i in range(M)]

for i in range(M):
    dfs[i]['foo'] = np.random.choice(N, size=N, replace=False)

def using_merge(dfs):
    result = dfs[0]
    for df in dfs[1:]:
        result = pd.merge(result, df, on='foo')
    return result

def using_reduce(dfs):
    return functools.reduce(lambda  left,right: 
                            pd.merge(left, right, on=['foo']), dfs)

def using_concat(dfs):
    return pd.concat([df.set_index('foo') for df in dfs], axis=1)

The problem with calling merge in a loop is that it returns an intermediate DataFrame which requires copying values from the left and right DataFrames. When done in a loop this leads to quadraticly increasing amounts of copying.

When the index is unique, pd.concat can be used to avoid the quadratic copying -- the values from all the DataFrames in dfs only need to be copied once into the result.

Here is a microbenchmark for the example above.

In [160]: %timeit using_concat(dfs)
10 loops, best of 3: 81.2 ms per loop

In [161]: %timeit using_merge(dfs)
1 loop, best of 3: 660 ms per loop

In [162]: %timeit using_reduce(dfs)
1 loop, best of 3: 659 ms per loop

The speed advantage is variable -- it depends on the number of DataFrames, M, to be merged. As M increases, so will the speed advantage of using_concat over using_merge or using_reduce. But let me emphasize again that pd.concat can be used as a substitute for pd.merge only when the column values are unique -- that is, the merge is 1-to-1, not many-to-one or one-to-many.

If the columns to be merged on do not have unique values for each DataFrame, then I don't see a faster way to compute the desired result than to call pd.merge in a loop.

Uri Goren · Answer 3 · 2017-08-14T20:27:37.280

First, we define how we merge 2 data frames

def merge_two(a,b, col):
    if b is None:
        return a
    return pd.merge(a,b,on=col)

Next, we want to merge as smaller parts as possible

from multiprocessing import Pool
from itertools import izip_longest as izip
from functools import partial

def merge_many(dfs, col):
    p = Pool(8) # number of cores
    merge = partial(merge_two, col=col)
    while len(dfs)>1:
        dfs = p.map(merge, izip(islice(dfs,0,None,2),islice(dfs,1,None,2)))
    return dfs[0]

Since the dataframes are independent, there's no harm in using the multiprocessing module to make it run in parallel

Python: Merging many dataframes in most efficient way possible

3 Answers3