-1

I could not think of a better way to word the question, kindly edit if you have something better.

date country total_cases
1 denmark 2
2 denmark 5
3 denmark 10
4 denmark 17
1 usa 5
2 usa 13
3 usa 23
4 usa 37
1 india 0
2 india 2
3 india 9
4 india 40

From the above dataframe, if I want to create a new feature new_cases, I can do this-

df['new_cases'] = df['total_cases'] - df['total_cases'].shift(1)

But that would also find values from where one country ends and another begins. (usa date 1 - denmark date 4). The workaround I have been using is running a loop like this-

new_cases = []
for country in df['country'].unique():
    temp_df = df[df['country'] == country]
    temp_array = temp_df['total_cases'] - temp_df['total_cases'].shift(1)
    new_cases.append(temp_array)

df['new_cases'] = pd.concat(temp_array)

As you can imagine, this takes time to run when the dataframe is huge, especially when creating multiple such features. Is there a better way to do this? Maybe a pandas function especially for this

callmeanythingyouwant
  • 1,789
  • 4
  • 15
  • 40

1 Answers1

1

Use DataFrameGroupBy.shift:

df['new_cases'] = df['total_cases'] - df.groupby('country')['total_cases'].shift(1)

Or DataFrameGroupBy.diff:

df['new_cases'] = df.groupby('country')['total_cases'].diff(1)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252