6

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?

I have nine datasets. All of them have the following columns:

org, name, items,spend

I want to join them into a single dataframe with the following columns:

org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...

I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:

ad = pd.DataFrame.merge(df_presents, df_trees,
                        on=['practice', 'name'],
                        suffixes=['_presents', '_trees'])

This works great, doing print list(aggregate_data.columns.values) shows me the following columns:

[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]

But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

Alexander
  • 105,104
  • 32
  • 201
  • 196
Richard
  • 62,943
  • 126
  • 334
  • 542
  • Just found this http://stackoverflow.com/questions/24853762/pandas-merging-multiple-dataframes but I'm not sure it works for my example - I guess I need to concatenate, then merge somehow? I want to write my output to a BigQuery table, so I don't know if hierarchical dataframes will work for me. – Richard Dec 17 '15 at 16:00
  • Found this http://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns?rq=1 which is more promising, but doesn't explain how to control the column names. – Richard Dec 17 '15 at 16:12

3 Answers3

7

You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:

result = functools.reduce(merge, dfs)

This is equivalent to

result = dfs[0]
for df in dfs[1:]:
    result = merge(result, df)

To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:

merge = functools.partial(pd.merge, on=['org', 'name'])

Since specifying the suffixes parameter in functools.partial would only allow one fixed choice of suffix, and since here we need a different suffix for each pd.merge call, I think it would be easiest to prepare the DataFrames column names before calling pd.merge:

for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)

For example,

import pandas as pd
import numpy as np
import functools
np.random.seed(2015)

N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)), 
                    columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())

yields

   org  name  items_df1  spend_df1  items_df2  spend_df2  items_df3  \
0    2     4          4          2          3          0          1   
1    2     4          4          2          3          0          1   
2    2     4          4          2          3          0          1   
3    2     4          4          2          3          0          1   
4    2     4          4          2          3          0          1   

   spend_df3  items_df4  spend_df4  items_df5  spend_df5  items_df6  \
0          3          1          0          1          0          4   
1          3          1          0          1          0          4   
2          3          1          0          1          0          4   
3          3          1          0          1          0          4   
4          3          1          0          1          0          4   

   spend_df6  items_df7  spend_df7  items_df8  spend_df8  items_df9  spend_df9  
0          3          4          1          3          0          1          2  
1          3          4          1          3          0          0          3  
2          3          4          1          3          0          0          0  
3          3          3          1          3          0          1          2  
4          3          3          1          3          0          0          3  
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This is great! Thank you very much! It unfortunately is an inner join however, how would you deal with an outer join? – jeangelj Feb 06 '17 at 17:14
  • @jeangelj: You can add `how='outer'` to the `functools.partial` call: e.g. `merge = functools.partial(pd.merge, on=['org', 'name'], how='outer')`. – unutbu Feb 06 '17 at 21:53
0

Would doing a big pd.concat() and then renaming all the columns work for you? Something like:

desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)


new_columns = ['org', 'name']
for i in range(num_dataframes):
    new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])

bid_df.columns = new_columns

This should give you columns like:

org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

Zachary Cross
  • 2,298
  • 1
  • 15
  • 22
0

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):

  1. Create an empty dictionary, merge_dict.
  2. Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
  3. Generate a new index as sorted(merge_dict).
  4. Generate a new list of data for each column by looping through merge_dict.items().
  5. Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.

Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.

Good luck.

Cmdt.Ed
  • 41
  • 1
  • 2