Pandas groupby multiple fields then diff

Question

So my dataframe looks like this:

         date    site country  score
0  2018-01-01  google      us    100
1  2018-01-01  google      ch     50
2  2018-01-02  google      us     70
3  2018-01-03  google      us     60
4  2018-01-02  google      ch     10
5  2018-01-01      fb      us     50
6  2018-01-02      fb      us     55
7  2018-01-03      fb      us    100
8  2018-01-01      fb      es    100
9  2018-01-02      fb      gb    100

Each site has a different score depending on the country. I'm trying to find the 1/3/5-day difference of scores for each site/country combination.

Output should be:

          date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

I first tried sorting by site/country/date, then grouping by site and country but I'm not able to wrap my head around getting a difference from a grouped object.

how to i get StringIO for python3? I am trying to recreate your problem — Julian Rachman, Jan 19 '18 at 18:40
@Alex @ayhan I've edited the expected output. Essentially, `es` and `gb` would appear before `us`. — Craig, Jan 19 '18 at 19:06
@Craig for that you could add `df.sort_values(by=['site', 'country', 'date'], ascending=[False, True, True])` with @ayhan 's answer — akhilsp, Jan 19 '18 at 19:14

score 74 · Accepted Answer · answered Jan 19 '18 at 19:11

74

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

answered Jan 19 '18 at 19:11

ayhan

70,170
20
182
203

6

For whatever reason, the line above kept throwing errors like `TypeError: diff() got an unexpected keyword argument 'axis'`. However this worked: `df.groupby(['site', 'country'])['score'].transform(pd.Series.diff).fillna(0)`. – Johan Dettmar Mar 12 '19 at 18:17
3

@JohanDettmar the reason for that exception was because you were calling `diff()` on a Series, which has only a single column, not a DataFrame. The Series `diff()` doesn't have an `axis` parameter, because there's only a single axis. – Vitor Baptista Apr 08 '20 at 17:13
Why am I not also grouping by date? Does diff recognize the date? I don't find anything regarding the automatic detection of a date https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.diff.html – Auss Nov 12 '20 at 13:11
1

@Auss Because we are trying to find the differences between these values on different dates. If we also group by dates, each group would have a single observation. Instead, we need multiple observations in each group (for different dates) so that we can find the differences between the values for those dates. – ayhan Nov 12 '20 at 13:29

score 1 · Answer 2 · answered Sep 10 '21 at 09:55

You can shift and substract grouped values:

df.sort_values(['site', 'country', 'date'], inplace=True)

df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()

Result:

         date    site country  score  diff
8  2018-01-01      fb      es    100   NaN
9  2018-01-02      fb      gb    100   NaN
5  2018-01-01      fb      us     50   NaN
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   NaN
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   NaN
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

To fill NaN with 0 use df['diff'].fillna(0, inplace=True).

Pandas groupby multiple fields then diff

2 Answers2

Linked

Related