1

I have a datafame like the following:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 1, 1, 2, 2, 2],
    'B': [1, 2, 3, 4, 5, 6],
    'C': [4, 5, 6, 7, 8, 9],
})

Now I want to group and aggregate with two values being produced per group. The result should be similar to the following:

expected = df.groupby('A').agg([min, max])

#     B       C    
#   min max min max
# A                
# 1   1   3   4   6
# 2   4   6   7   9

However, in my case, instead of two distinct functions min and max, I have one function that computes these two values at once:

def minmax(x):
    """This function promises to compute the min and max in one go."""
    return min(x), max(x)

Now my question is, how can I use this one function to produce two aggregation values per group?

It's kind of related to this answer but I couldn't figure out how to do it. The best I could come up with is using a doubly-nested apply however this is not very elegant and also it produces the multi-index on the rows rather than on the columns:

result = df.groupby('A').apply(
    lambda g: g.drop(columns='A').apply(
        lambda h: pd.Series(dict(zip(['min', 'max'], minmax(h))))
    )
)

#        B  C
# A          
# 1 min  1  4
#   max  3  6
# 2 min  4  7
#   max  6  9
a_guest
  • 34,165
  • 12
  • 64
  • 118

2 Answers2

2

If you are stuck with a function that returns a tuple of values. I'd:

  1. Define a new function that wraps the tuple values into a dict such that you predefine the dict.keys() to align with what you want the column names to be.
  2. Use a careful for loop that doesn't waste time and space.

Wrap Function

# Given Function
def minmax(x):
    """This function promises to compute the min and max in one go."""
    return min(x), max(x)

# wrapped function
def minmax_dict(x):
    return dict(zip(['min', 'max'], minmax(x)))

Careful for loop

I'm aiming to pass this dictionary into the pd.DataFrame constructor. That means, I want tuples of the MultiIndex column elements in the keys. I want the values to be dictionaries with keys being the index elements.

dat = {}
for a, d in df.set_index('A').groupby('A'):
    for cn, c in d.iteritems():
        for k, v in minmax_dict(c).items():
            dat.setdefault((cn, k), {})[a] = v

pd.DataFrame(dat).rename_axis('A')

    B       C    
  min max min max
A                
1   1   3   4   6
2   4   6   7   9

Added Detail

Take a look at the crafted dictionary

data

{('B', 'min'): {1: 1, 2: 4},
 ('B', 'max'): {1: 3, 2: 6},
 ('C', 'min'): {1: 4, 2: 7},
 ('C', 'max'): {1: 6, 2: 9}}
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • By "don't do that", do you refer to your answer or to my objective? I should also note that the solution should be more efficient than simply `df.groupby('A').agg([lambda x: minmax(x)[0], lambda x: minmax(x)[1]])`. Of course, in order to benchmark this I would need to provide a more complex dataframe as well as a more useful implementation of `minmax`. – a_guest Mar 18 '21 at 22:03
  • There might be a misunderstanding, when you suggest I shouldn't do that, as I don't have much of a choice. I used the `minmax` function for the ease of the example, but in reality I have a function which cannot simply be split into multiple others. In fact, my function fits the data of each group to a model and then returns a bunch of fit parameters together with their error estimates (so actually more than just two). The resulting data frame should then contain the parameter estimates for each of the groups. – a_guest Mar 19 '21 at 10:32
  • Ahh, that makes sense. I was getting a bit judgy under the assumption that you were doing this on purpose. If not, then there are some things you can do to wrap the function into something more convenient. Let me think about it. – piRSquared Mar 19 '21 at 16:09
  • @a_guest I updated my post with my actual recommendation. – piRSquared Mar 19 '21 at 17:13
1

One other solution:

pd.concat({k:d.agg(minmax).set_axis(['min','max'])
           for k,d in df.drop('A',axis=1).groupby(df['A'])
          })

Output:

       B  C
1 min  1  4
  max  3  6
2 min  4  7
  max  6  9
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • The solution should be equivalent to `df.groupby('A').agg([min, max])`, i.e. the multi-index should be on the columns. Also the loop over groups isn't much different from using `apply` on the groupby object, is it? – a_guest Mar 18 '21 at 21:51
  • Chain the solution with unstack and you should get the desired format. And yes, it is equivalent to apply. And last suggestion I’d say the same as @piRSqaured, avoid doing what you do with the minmax function – Quang Hoang Mar 18 '21 at 22:06
  • Using `unstack` gives me *almost* an equivalent solution, the only difference being the index which is then unnamed. The example used in my question might be a little misleading in that it is composed of two distinct functions. In my use case, I have a function `myfunc` which cannot simply be split into two functions. The only way to do that would be `df.groupby('A').agg([lambda x: myfunc(x)[0], lambda x: myfunc(x)[1]])` but then `myfunc` is not cheap to compute and so this would waste a lot of compute power. – a_guest Mar 18 '21 at 22:42
  • If you want the name, then chain that with `rename_axis('A')` after unstack. – Quang Hoang Mar 18 '21 at 23:08
  • Well, that's exactly why I came here asking this question, because stacking and nesting so many function calls is not really readable anymore. So I was hoping for a cleaner solution. Anyway thanks for your answer. By the way, you should update it to match the expected result from the question. – a_guest Mar 18 '21 at 23:10