0

I have a list of dataframes I want to merge together with the following requirements:

  • If the dataframes have the same columns (i.e. if it's a data refresh of some sort), then they should be "concatenated" (no creating of suffixes such as _x or _y)
  • The merging should be done on a list of indices so that no duplicate rows will be found

Example:

df1 = pd.DataFrame{'WeekCom':['2020-01-01','2020-01-02','2020-01-02','2020-01-03'],'Y':[2020,2020,2020,2020],'QT':[Q1,Q1,Q1,Q1],'M':['Jan','Jan','Jan','Jan'],'W':['W1','W1','W1','W2'],'Col_X':[0,1,1,2],'Col_Y':[3,0,0,1]}


df2 = pd.DataFrame{'WeekCom':['2020-01-02','2020-01-04'],'Y':[2020,2020],'QT':[Q1,Q1],'M':['Jan','Jan'],'W':['W1','W2'],'Col_X':[1,3],'Col_Z':[3,3]}


dataframe_list = [df1,df2]

current_file_type_merged_dataframe = reduce(lambda merged, df: pd.merge(merged, df, on=list_of_indices, how='outer'), dataframe_list)

This kinda works on the list of indices ('WeekCom','Y','H','QT',...) since I get a unique set of indices on the dataframe, but then the columns that have the same name from both dataframes, are "separated" by the suffix "_x" and "_y". I've already checked this SO thread Pandas Merging 101, but no sucess.

Thanks!

Chronicles
  • 436
  • 1
  • 11
  • Just write a for-loop, with an if condition inside for the check whether there are new columns or if they're just the same, and then the necessary action. That takes care of the first condition. I.e., if you need to concatenate horizontally (merge), you do that, and if need to concatenate vertically (append) you do that. – 9769953 Aug 14 '23 at 14:42

1 Answers1

-1

To achieve the desired behavior where dataframes are concatenated without creating suffixes for columns with the same name and the merging is done on a list of indices to avoid duplicate rows, you can use the pd.concat function followed by a groupby operation. Here's an example of how to do this:

import pandas as pd

# Sample dataframes
df1 = pd.DataFrame({
    'WeekCom': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'],
    'Y': [2020, 2020, 2020, 2020],
    'QT': ['Q1', 'Q1', 'Q1', 'Q1'],
    'M': ['Jan', 'Jan', 'Jan', 'Jan'],
    'W': ['W1', 'W1', 'W1', 'W2'],
    'Col_X': [0, 1, 1, 2],
    'Col_Y': [3, 0, 0, 1]
})

df2 = pd.DataFrame({
    'WeekCom': ['2020-01-02', '2020-01-04'],
    'Y': [2020, 2020],
    'QT': ['Q1', 'Q1'],
    'M': ['Jan', 'Jan'],
    'W': ['W1', 'W2'],
    'Col_X': [1, 3],
    'Col_Z': [3, 3]
})

# List of dataframes
dataframe_list = [df1, df2]

# Concatenate dataframes
concatenated_df = pd.concat(dataframe_list, ignore_index=True)

# Group by the list of indices and aggregate by selecting the first value (no suffixes)
result_df = concatenated_df.groupby(['WeekCom', 'Y', 'QT', 'M', 'W'], as_index=False).first()

# Output the result
print(result_df)

This code first concatenates the dataframes in dataframe_list using pd.concat and then groups the concatenated dataframe by the list of indices (in this case, ['WeekCom', 'Y', 'QT', 'M', 'W']) and aggregates by selecting the first value in each group. This ensures that duplicate rows are removed while columns with the same name are not duplicated with suffixes. The final result is a merged dataframe based on the specified indices.

moe_
  • 219
  • 2
  • 3
  • 15