pandas: merge several dataframes

Question

I have several dataframes df1, df2, ... with dublicate data, partly overlapping columns and rows ((see below)

How can I lump all dataframes into one dataframe.

df1 = pd.DataFrame({'A': [1,2], 'B': [4,5]}, index=['a', 'b'])

df2 = pd.DataFrame({'B': [5,6], 'C': [8,9]}, index=['b', 'c'])

df3 = pd.DataFrame({'A': [2,3], 'B': [5,6]}, index=['b', 'c'])

df4 = pd.DataFrame({'C': [7,8], index=['a', 'b'])

df5 = pd.DataFrame({'A': [1], 'B': [4], 'C': [7]}, index=['a'])

....

added: example data structure

   A   B   C
a  1   4   7
b  2   5   8
c  3   6   9

added: what I am realy looking for is a more effective way for the following script, which is realy slow for big dataframes

dfs =[df1, df2, df3, df4, df5]
cols, rows = [], []
for df in dfs:
    cols = cols + df.columns.tolist()
    rows = rows + df.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)

merged_dfs = pd.DataFrame(data=np.nan, columns=cols, index=rows)

for df in dfs:
    for col in df.columns:
        for row in df.index:
            merged_dfs[col][row] = df[col][row]

fast and easy solution (added 23. Dez. 2015)

dfs =[df1, df2, df3, df4, df5]

# create empty DataFrame with all cols and rows
cols, rows = [], []
for df_i in dfs:
    cols = cols + df_i.columns.tolist()
    rows = rows + df_i.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)       
df = pd.DataFrame(data=np.NaN, columns=cols, index=rows) 

# fill DataFrame
for df_i in dfs:
    df.loc[df_i.index, df_i.columns] = df_i.values

The linked duplicate has examples on how to merge multiple dataframes with unique/non-unique indexes. — cs95, Dec 08 '18 at 09:41

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

With index preservation

This is an updated version that preserves the index:

from functools import reduce

dfs = [df1, df2, df3, df3, df5]

def my_merge(df1, df2):
    res = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
    cols = sorted(res.columns)
    pairs = []
    for col1, col2 in zip(cols[:-1], cols[1:]):
        if col1.endswith('_x') and col2.endswith('_y'):
            pairs.append((col1, col2))
    for col1, col2 in pairs:
        res[col1[:-2]] = res[col1].combine_first(res[col2])
        res = res.drop([col1, col2], axis=1)
    return res

print(reduce(my_merge, dfs))

Output:

Without index preservation

This would be one way:

from functools import reduce  # Python 3 only

dfs = [df1, df2, df3, df3, df5]

def my_merge(df1, df2):
    return pd.merge(df1, df2, how='outer')

merged_dfs = reduce(my_merge, dfs)

Results in:

   A  B   C
0   1  4 NaN
1   2  5   8
2 NaN  6   9
3   3  6 NaN
4   1  4   7

You can adapt the join method by setting how:

how : {'left', 'right', 'outer', 'inner'}, default 'inner'

left: use only keys from left frame (SQL: left outer join)

right: use only keys from right frame (SQL: right outer join)

outer: use union of keys from both frames (SQL: full outer join)

inner: use intersection of keys from both frames (SQL: inner join)

If you like lambda, use this version for the same result:

reduce(lambda df1, df2: pd.merge(df1, df2, how='outer'), dfs)

Merge do note preserve the index (as you can see in the above results) — user3368210, Dec 22 '15 at 10:46
" with index preservation" is exactely what I was looking for. Thank you! — user3368210, Dec 22 '15 at 11:27
take care, there are weird situations were "with index preservation" in not working ! — user3368210, Dec 22 '15 at 21:01

score 1 · Answer 2 · answered Dec 22 '15 at 09:45

1

Same idea as the other answer, but slightly different function:

def multiple_merge(lst_dfs, on):
    reduce_func = lambda left,right: pd.merge(left, right, on=on)
    return reduce(reduce_func, lst_dfs)

Here, lst_dfs is a list of dataframes

answered Dec 22 '15 at 09:45

dslack

835
6
17

pandas: merge several dataframes

fast and easy solution (added 23. Dez. 2015)

2 Answers2

With index preservation

Without index preservation