How can I aggregate data when the order of grouped data is important? (bonus points if this can be done in an elegant vectorized way). If that was clear as mud, let me explain with an example.
Let's say I have data in df
:
id month value
------------------------------
001 2019-01-01 (Jan) 111
001 2019-02-01 (Feb) 222
001 2019-03-01 (Mar) 333
002 2019-01-01 (Jan) 0
002 2019-02-01 (Feb) 0
002 2019-03-01 (Mar) 25
... ... ...
999 2019-01-01 (Jan) 800
999 2019-02-01 (Feb) 600
999 2019-03-01 (Mar) 400
I can use groupby
to aggregate the data over each id
:
df.groupby('id')['value'].agg([numpy.sum, numpy.mean])
Whether I use numpy.sum
, numpy.mean
, numpy.max
, etc. as the aggregating function, the order of the isolated array that is grouped doesn't matter (e.g., [111, 222, 333]
for id=001
) - the result will always be the same.
However, there are some aggregations where the order does matter - for example, I may want to calculate:
- a weighted average (e.g., if more recent values have more weight)
- a start-to-finish change (e.g.,
Mar
-Jan
) - etc.
Currently, I loop through each id
and then:
- filter the data through
df[df['id']==id]
- get a list of month-value tuples, e.g.
[(Jan,111), (Feb,222), (Mar,333)]
- sort list based on the first element of each tuple, i.e.,
'month'
- perform aggregation
For example, if I just wanted to find the difference between the first and last elements of that sorted array, then I'd end up with this:
id finish_minus_start
------------------------
001 222
002 25
... ...
999 -400
How can I aggregate data when the order of grouped data is important?
Can I do this more efficiently by making use of vectorization instead of looping through each id
?