I could not think of a better way to word the question, kindly edit if you have something better.
date | country | total_cases |
---|---|---|
1 | denmark | 2 |
2 | denmark | 5 |
3 | denmark | 10 |
4 | denmark | 17 |
1 | usa | 5 |
2 | usa | 13 |
3 | usa | 23 |
4 | usa | 37 |
1 | india | 0 |
2 | india | 2 |
3 | india | 9 |
4 | india | 40 |
From the above dataframe, if I want to create a new feature new_cases
, I can do this-
df['new_cases'] = df['total_cases'] - df['total_cases'].shift(1)
But that would also find values from where one country ends and another begins. (usa date 1 - denmark date 4). The workaround I have been using is running a loop like this-
new_cases = []
for country in df['country'].unique():
temp_df = df[df['country'] == country]
temp_array = temp_df['total_cases'] - temp_df['total_cases'].shift(1)
new_cases.append(temp_array)
df['new_cases'] = pd.concat(temp_array)
As you can imagine, this takes time to run when the dataframe is huge, especially when creating multiple such features. Is there a better way to do this? Maybe a pandas function especially for this