I have a dataset where groups undergo treatments at different times, and I need to log the year in which the groups first become treated, else giving the value of 0 for all non-treated groups.
df = pd.DataFrame([['CA',2014,0],['CA',2015,0],['CA',2016,1],['CA',2017,1],
['WA',2011,0],['WA',2012,1],['WA',2013,1],['TX',2010,0]],
columns=['Group_ID','Year','Treated'])
The dataframe should look like this once complete:
Group_ID | Year | Treated | First_Treated |
---|---|---|---|
CA | 2014 | 0 | 0 |
CA | 2015 | 0 | 0 |
CA | 2016 | 1 | 2016 |
CA | 2017 | 1 | 2016 |
WA | 2011 | 0 | 0 |
WA | 2012 | 1 | 2012 |
WA | 2013 | 1 | 2012 |
TX | 2010 | 0 | 0 |
The Python code below returns every subsequent year
value rather than the first year
of treatment.
df['first_treated'] = np.where(df['Treated']==1, df['Year'], 0)
I have tried agg()
and min()
functions but neither work properly.
df['first_treated'] = np.where(df['Treated']==1,df['Year'].min,0)
I have also used the R code in Create a group variable first.treat indicating the first year when each unit becomes treated, but using an empty first_treated
column, no data is inserted into the column with the mutate()
function. I receive no errors using that R script on the similar pandas dataframe.