2

I have a dataframe of event occurrences in which the date is formatted as 'YYYY-WW'. Various events can take place, some at the same time as other events in the same timeframe. Example datframe as follows;

df1 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03', '2022-01','2022-02','2022-03','2022-01','2022-03'],
                         'event': ['event1','event1','event1','event2','event2','event3','event4','event4'],
                         'event_flag': [1,1,1,1,1,1,1,1,]})

enter image description here

I have a 2nd dataframe to which I want to left join the 1st dataframe. The 2nd dataframe could potentially contain many more dates than is featured in df1 but the for the purposes of this question is as follows:

df2 = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana']})

enter image description here

Ultimately, I want to perform a left join such that the values from event in df1 become additional column headers in df2, with the event_flag from df1 becoming a boolean value under the respective column header, as follows:

desired_outcome = pd.DataFrame({'yyyyww': ['2022-01','2022-02','2022-03'],
                         'col1': ['apple','car','banana'],
                         'event1':[1,1,1],
                         'event2':[1,1,0],
                         'event3':[0,0,1],
                         'event4':[1,0,1],
                         })

enter image description here

However, when using iterrows() to achieve this, what I end up with is something that bears some resemblance to the desired outcome but duplicates the columns such that I end up with multiple columns with suffixes, as follows:

for index, row in df1.iterrows():
    index_value = row['event']
    #column_a_value = row['disco']
    yyyyww = row['yyyyww']
    event_flag = row['event_flag']
    df2 = df2.merge(pd.DataFrame({'yyyyww': [yyyyww],
                                                        f'{index_value}': [event_flag]
                                                        }),
                                           left_on='yyyyww', right_on='yyyyww', how='left')
                                        

df2.fillna(0)

enter image description here

How can I perform the required operation without resulting in duplicated columns?

jimiclapton
  • 775
  • 3
  • 14
  • 42

2 Answers2

2

After .merge you can .pivot the dataframe to obtain the final form:

out = (
    df2.merge(df1, on="yyyyww")
    .pivot(index=["yyyyww", "col1"], columns="event", values="event_flag")
    .fillna(0)
    .astype(int)
    .reset_index()
    .rename_axis(columns=None, index=None)
)
print(out)

Prints:

    yyyyww    col1  event1  event2  event3  event4
0  2022-01   apple       1       1       0       1
1  2022-02     car       1       1       0       0
2  2022-03  banana       1       0       1       1
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Ok this seems to work. Are you able to explain how? – jimiclapton Aug 15 '23 at 20:16
  • 1
    @jimiclapton It works because with pivoting you select one column as headers of the new dataframe (`event` in this case) and column where values are from (`event_flag`) - for the missing values pandas will put `NaN`, so one needs to replace them with `0`. More here: https://stackoverflow.com/questions/28337117/how-to-pivot-a-dataframe-in-pandas – Andrej Kesely Aug 15 '23 at 20:18
0
df=df1.merge(df2,on='yyyyww')
dummies=pd.get_dummies(df['event'])
df=df.join(dummies)
df=df.drop(['event','event_flag'],axis=1)
gp=df.groupby(['yyyyww','col1'])
gp.sum(numeric_only=True).reset_index()

Broadly speaking, the use of iterrows, especially for setting things, can be a bit of a footgun (not to mention slow), so I personally try to avoid it where possible.

wotb
  • 46
  • 1
  • 3