Pandas join dataframes based on different columns

Question

I have been trying to merge multiple dataframes using reduce() function mentioned in this link pandas three-way joining multiple dataframes on columns.

dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

However, in my case the join columns are different for the related dataframes. Therefore I would need to use different left_on and right_on values on every merge.

I have come up with a workaround, which is not efficient or elegant in any way, but for now it works. I would like to know if the same can be achieved using reduce() or may be other efficient alternatives. I am foreseeing that there would be many dataframes I would need to join down-the-line.

import pandas as pd

...
...

# xml files - table1.xml, table2.xml and table3.xml are converted to <dataframe11>, <dataframe2>, <dataframe3> respectively.
_df = {
              'table1' : '<dataframe1>',
              'table2' : '<dataframe2>',
              'table3' : '<dataframe3>'
           }

# variable that tells column1 of table1 is related to column2 of table2, which can be used as left_on/right_on while merging dataframes
_relationship = {
                    'table1': {
                            'table2': ['NAME', 'DIFF_NAME']}, 
                    'table2': {
                            'table3': ['T2_ID', 'T3_ID']}
                }

def _join_dataframes(_rel_pair):


    # copy
    df_temp = dict(_df)

    for ele in _rel_pair:
        first_table = ele[0]
        second_table = ele[1]
        lefton = _onetomany[first_table][second_table][0]
        righton = _onetomany[first_table][second_table][1]

        _merged_df = pd.merge(df_temp[first_table], df_temp[second_table],
                                   left_on=lefton, right_on=righton, how="inner")
        df_temp[ele[1]] = _merged_df

    return _merged_df

# I have come up with this structure based on _df.keys()
_rel_pair = [['table1', 'table2'], ['table2', 'table3']]

_join_dataframes(_rel_pair)

score 0 · Answer 1 · answered May 27 '20 at 20:04

0

Why don't you just rename the columns of all the dataframes first?

df0.rename({'commonname': 'old_column_name0'}, inplace=True)
          .
          .
          .
          .
dfN.rename({'commonname': 'old_column_nameN'}, inplace=True)


dfs = [df0, df1, df2, ... , dfN]
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

answered May 27 '20 at 20:04

Emiliano Isaza Villamizar

357
2
15

I think renaming would not help because there is no common column for all the tables. For example table1 has "name" column which is related to "site_name" column in table2. Where as "tx_id" of table2 is related to "txz_id" of table3. – Pro May 27 '20 at 21:19

score 0 · Answer 2 · answered May 27 '20 at 22:02

Try using the concat function, instead of reduce.

A simple trick I like to use when merging DFs is setting the index on the columns I want to use as a guide when merging. Example:

# note different column names 'B' and 'C'
dfA = pd.read_csv('yourfile_A.csv', index_col=['A', 'B']
dfB = pd.read_csv('yourfile_B.csv', index_col=['C', 'D']

df = pd.concat([dfA, dfB], axis=1)

You will need unique indexes / multiindexes for this to work, but I think this should be no problem for most cases. Never tried a large concat, but this approach should theoretically work for N concats.

Alternatively, you can use merge instead, as it provide left_on and right_on parameters specially for those situations where column names differ between dataframes. An example:

dfA.merge(dfB, left_on='name', right_on='username')

A more complete explanation on how to merge dfs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

concat: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

Pandas join dataframes based on different columns

2 Answers2