Iterrows merge in Pandas results in duplicated columns with suffixes

Question

I have a dataframe of event occurrences in which the date is formatted as 'YYYY-WW'. Various events can take place, some at the same time as other events in the same timeframe. Example datframe as follows;

df1 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03', '2022-01','2022-02','2022-03','2022-01','2022-03'],
                         'event': ['event1','event1','event1','event2','event2','event3','event4','event4'],
                         'event_flag': [1,1,1,1,1,1,1,1,]})

I have a 2nd dataframe to which I want to left join the 1st dataframe. The 2nd dataframe could potentially contain many more dates than is featured in df1 but the for the purposes of this question is as follows:

df2 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana']})

Ultimately, I want to perform a left join such that the values from event in df1 become additional column headers in df2, with the event_flag from df1 becoming a boolean value under the respective column header, as follows:

desired_outcome = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana'],
                         'event1':[1,1,1],
                         'event2':[1,1,0],
                         'event3':[0,0,1],
                         'event4':[1,0,1],
                         })

However, when using iterrows() to achieve this, what I end up with is something that bears some resemblance to the desired outcome but duplicates the columns such that I end up with multiple columns with suffixes, as follows:

for index, row in df1.iterrows():
    index_value = row['event']
    #column_a_value = row['disco']
    yyyyww = row['yyyyww']
    event_flag = row['event_flag']
    df2 = df2.merge(pd.DataFrame({'yyyyww': [yyyyww],
                                                        f'{index_value}': [event_flag]
                                                        }),
                                           left_on='yyyyww', right_on='yyyyww', how='left')
                                        

df2.fillna(0)

How can I perform the required operation without resulting in duplicated columns?

score 2 · Accepted Answer · answered Aug 15 '23 at 20:14

2

After .merge you can .pivot the dataframe to obtain the final form:

out = (
    df2.merge(df1, on="yyyyww")
    .pivot(index=["yyyyww", "col1"], columns="event", values="event_flag")
    .fillna(0)
    .astype(int)
    .reset_index()
    .rename_axis(columns=None, index=None)
)
print(out)

Prints:

    yyyyww    col1  event1  event2  event3  event4
0  2022-01   apple       1       1       0       1
1  2022-02     car       1       1       0       0
2  2022-03  banana       1       0       1       1

answered Aug 15 '23 at 20:14

Andrej Kesely

168,389
15
48
91

Ok this seems to work. Are you able to explain how? – jimiclapton Aug 15 '23 at 20:16
1

@jimiclapton It works because with pivoting you select one column as headers of the new dataframe (`event` in this case) and column where values are from (`event_flag`) - for the missing values pandas will put `NaN`, so one needs to replace them with `0`. More here: https://stackoverflow.com/questions/28337117/how-to-pivot-a-dataframe-in-pandas – Andrej Kesely Aug 15 '23 at 20:18

score 0 · Answer 2 · answered Aug 15 '23 at 20:46

df=df1.merge(df2,on='yyyyww')
dummies=pd.get_dummies(df['event'])
df=df.join(dummies)
df=df.drop(['event','event_flag'],axis=1)
gp=df.groupby(['yyyyww','col1'])
gp.sum(numeric_only=True).reset_index()

Broadly speaking, the use of iterrows, especially for setting things, can be a bit of a footgun (not to mention slow), so I personally try to avoid it where possible.

Iterrows merge in Pandas results in duplicated columns with suffixes

2 Answers2