pandas add column to dataframe aggregate on time series

Question

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.

this is my code and head of dataframe

df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word

Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)

df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)

Data frame df1 print before last line is like that:

Last line error is : KeyError: 'year'

thanks

Is year a row instead of a column? Seems that way, I'd just like to confirm — pasha, Jun 28 '20 at 14:47
Did you try to add the new column to the original df before using "group by"? Otherwise you can use "transform" to add a column back to the original df, see here: https://stackoverflow.com/questions/37189878/pandas-add-column-to-groupby-dataframe — Jonas, Jun 28 '20 at 15:26

score 0 · Accepted Answer · answered Jun 28 '20 at 19:38

When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:

print (df1.columns)

And then you can select them.

Using the MultiIndex column

So to select the column which matches to 2020 you can use:

df1.loc[:,df1.columns.get_level_values(2).isin({2020})

You can probably get the correct column then check if 2020 has a non zero value using:

df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0

If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).

Renaming the columns

If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like

df1.columns = df1.columns.get_level_values('year')

Or

df1.columns = df1.columns.get_level_values(2)

And then

df1['nb2020'] = (df1[2020] > 0).astype(int)

pandas add column to dataframe aggregate on time series

1 Answers1

Using the MultiIndex column

Renaming the columns