2

How can I aggregate data when the order of grouped data is important? (bonus points if this can be done in an elegant vectorized way). If that was clear as mud, let me explain with an example.


Let's say I have data in df:

id    month              value
------------------------------
001   2019-01-01 (Jan)     111
001   2019-02-01 (Feb)     222
001   2019-03-01 (Mar)     333

002   2019-01-01 (Jan)       0
002   2019-02-01 (Feb)       0
002   2019-03-01 (Mar)      25

...   ...                  ...

999   2019-01-01 (Jan)     800
999   2019-02-01 (Feb)     600
999   2019-03-01 (Mar)     400

I can use groupby to aggregate the data over each id:

df.groupby('id')['value'].agg([numpy.sum, numpy.mean])

Whether I use numpy.sum, numpy.mean, numpy.max, etc. as the aggregating function, the order of the isolated array that is grouped doesn't matter (e.g., [111, 222, 333] for id=001) - the result will always be the same.


However, there are some aggregations where the order does matter - for example, I may want to calculate:

  • a weighted average (e.g., if more recent values have more weight)
  • a start-to-finish change (e.g., Mar - Jan)
  • etc.

Currently, I loop through each id and then:

  1. filter the data through df[df['id']==id]
  2. get a list of month-value tuples, e.g. [(Jan,111), (Feb,222), (Mar,333)]
  3. sort list based on the first element of each tuple, i.e., 'month'
  4. perform aggregation

For example, if I just wanted to find the difference between the first and last elements of that sorted array, then I'd end up with this:

id    finish_minus_start
------------------------
001                  222
002                   25
...                  ...
999                 -400

How can I aggregate data when the order of grouped data is important?

Can I do this more efficiently by making use of vectorization instead of looping through each id?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Ben
  • 139
  • 5
  • Maybe this solution helps: https://stackoverflow.com/questions/25995771/pandas-sorting-by-group-aggregate It computes a previous order to rearrange the grouping – BCJuan Apr 02 '19 at 09:45
  • @BCJuan - thanks - this looks potentially useful. That solution will order `df` such that for each `id`, the rows will be sorted by `'month'`. Good start. But can you confirm that when aggregating on this sorted `df`, that the order of those rows are reflected in the values that are passed into the aggregating function (e.g., `numpy.mean(x)`)? i.e., that the isolated values aren't jiggled about under-the-hood for some reason that I can't comprehend. – Ben Apr 02 '19 at 09:58
  • what about something like `df.groupby('id').apply(lambda x: x.sort_values('month'))`? – BCJuan Apr 02 '19 at 10:19
  • I guess vectorized functions are only available for the most common operations, for rest of them, we have to use: `df.groupby('id').apply(lambda x: User_Defined_function(x))` – hacker315 Apr 02 '19 at 10:34
  • @hacker315 - can you be a little more explicit, since I'm struggling to see the solution! Say I wanted my user-defined function to be `difference_between_first_and_last_elements()`, how would I define that function to return a single value for each `id`? i.e., `sorted_list[-1] - sorted_list[0]`. – Ben Apr 02 '19 at 14:50

1 Answers1

0

I hope this is what you were looking for. You can use aggregates and set your own function. I took an example from two links Summarizing and Grouping data and First and Last Functions and played around a little.

df= df.set_index('date')
aggregations = {
    'value': lambda x: x.loc[x.index.max()] - x.loc[x.index.min()]
}
print(df.groupby('id').agg(aggregations))
Muzamir
  • 88
  • 1
  • 8
  • At the moment, I think this just finds the difference between the first and last elements in the `value` column (good) but I'm not guaranteed that those first and last elements are defined by what's in the corresponding `month` column (bad) - it could just be randomly ordered for all I know. Can you amend your answer so that the result would yield the dataframe I displayed at the bottom of my original post? – Ben Apr 03 '19 at 08:29
  • Edited. If i understand correctly, this should be it. You will need to assign a date column for this then index it. – Muzamir Apr 03 '19 at 09:06
  • Many thanks - this works :) although I'm a little nervy using it since I don't properly understand what's going on (although that's down to my limited knowledge of lambda functions). How does the lambda function know what to do with the `value` column - I don't see it referenced anywhere within the lambda function? – Ben Apr 03 '19 at 13:06
  • Just see lambda as a single line function. You can try create a separate function to help you understand... func_a(df): return x.loc[x.index.max()] and etc..... From my understanding of how it works.. Similarly, the dataframe df is thrown into func_a(df). The aggregations Dict has a key named: "value" and value as a function result from column "value", it is then returned to aggregate. – Muzamir Apr 03 '19 at 15:37