38

So my dataframe looks like this:

         date    site country  score
0  2018-01-01  google      us    100
1  2018-01-01  google      ch     50
2  2018-01-02  google      us     70
3  2018-01-03  google      us     60
4  2018-01-02  google      ch     10
5  2018-01-01      fb      us     50
6  2018-01-02      fb      us     55
7  2018-01-03      fb      us    100
8  2018-01-01      fb      es    100
9  2018-01-02      fb      gb    100

Each site has a different score depending on the country. I'm trying to find the 1/3/5-day difference of scores for each site/country combination.

Output should be:

          date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

I first tried sorting by site/country/date, then grouping by site and country but I'm not able to wrap my head around getting a difference from a grouped object.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
Craig
  • 1,929
  • 5
  • 30
  • 51

2 Answers2

74

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 6
    For whatever reason, the line above kept throwing errors like `TypeError: diff() got an unexpected keyword argument 'axis'`. However this worked: `df.groupby(['site', 'country'])['score'].transform(pd.Series.diff).fillna(0)`. – Johan Dettmar Mar 12 '19 at 18:17
  • 3
    @JohanDettmar the reason for that exception was because you were calling `diff()` on a Series, which has only a single column, not a DataFrame. The Series `diff()` doesn't have an `axis` parameter, because there's only a single axis. – Vitor Baptista Apr 08 '20 at 17:13
  • Why am I not also grouping by date? Does diff recognize the date? I don't find anything regarding the automatic detection of a date https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.diff.html – Auss Nov 12 '20 at 13:11
  • 1
    @Auss Because we are trying to find the differences between these values on different dates. If we also group by dates, each group would have a single observation. Instead, we need multiple observations in each group (for different dates) so that we can find the differences between the values for those dates. – ayhan Nov 12 '20 at 13:29
1

You can shift and substract grouped values:

df.sort_values(['site', 'country', 'date'], inplace=True)

df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()

Result:

         date    site country  score  diff
8  2018-01-01      fb      es    100   NaN
9  2018-01-02      fb      gb    100   NaN
5  2018-01-01      fb      us     50   NaN
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   NaN
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   NaN
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

To fill NaN with 0 use df['diff'].fillna(0, inplace=True).

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73