2

I am trying to perform a calculation within pandas aggregations. I want the calculations to be included in the aggregations. The code on what I am attempting is below. I am also using the pandas package for the df.

data = data.groupby(['type', 'name']).agg({'values': [np.min, np.max, 100 * sum([('values' > 3200)] / [np.size])]})

The formula I am trying to calculate is below:

100 * sum((values > 3200) / (np.size))

This is where np is the size of the aggregation (the numbers aggregated are limited to numbers > 3200). How to perform calculations like this within the aggregations would be of great help.

Example input data (actual dataset is much larger). The repeat values are due to the aggregation.

type, name, values
apple, blue, 2500
orange, green, 2800
peach, black, 3300
lemon, white, 3500

Desired example output (numbers are not correct due to the fact that I have yet to be able to perform the calculation):

type, name, values, np.min, np.max, calcuation
apple, blue, 2500, 1200, 40000, 2300
orange, green, 2800, 1200, 5000, 2500
Alexander
  • 105,104
  • 32
  • 201
  • 196
user2524994
  • 357
  • 1
  • 5
  • 13
  • Can you give more description on your use case? Such as the input data and desired output, I think the mention of "aggregation" might be confusing things in this case. – dave May 28 '15 at 17:50
  • a small working snippet would be useful to know what you're trying to do. – user1269942 May 28 '15 at 17:51
  • data is updated. The rest of the code is just a simple pull from a database and is put into a pandas df – user2524994 May 28 '15 at 17:54
  • Youi show example data, but not output data. More narative is needed – dave May 28 '15 at 17:55
  • should be more clear now – user2524994 May 28 '15 at 18:00
  • Sorry still confused. Is type, name, values the output that you want or is this an example of input. Without seeing both input and output I can't wrap my head around what you want. – dave May 28 '15 at 18:03
  • it is the desired input. will update with example output (numbers will not be correct though. – user2524994 May 28 '15 at 18:07
  • You start with dataframe columns [type, name, values] and want a new dataframe with [type, name, values, min, max, calculation]? I'm stull confused by what is in these columns: [min, max, calculation]. Do you want to group the frame by all unique instances of [type, name]? – dave May 28 '15 at 18:13
  • yes I would like to group by that – user2524994 May 28 '15 at 18:31
  • Isn't this more of a `pandas` question than a `numpy` one? Your title and description should reflect that. – hpaulj May 28 '15 at 18:34

2 Answers2

2

Passing df.agg a dictionary is used to specify the name of the output columns, here you're essentially writing an aggregation function which is attempting to use three formulas for one named column, and that column is already in your dataframe so its going to fail.

What you should be doing should look more like:

data = data.groupby(['type', 'name']).agg({'min':np.min, 'max':np.max, 'calculation': calculation})

Where you've rewritten your calculation function as either a lambda or a custom function, depending on how you want to do things.

  • anyway you could give me an example of the function with numpy implemented in it? – user2524994 May 28 '15 at 18:31
  • 1
    Similar questions that will give you examples to work from: http://stackoverflow.com/questions/26812763/applying-a-custom-groupby-aggregate-function-to-output-a-binary-outcome-in-panda http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns – Nicholas Aldershof May 28 '15 at 18:42
0

You need to define the function that acts on the group to give you the percentage of values greater than 3200 and pass this, along with the other function into .agg:

func = lambda series: 100* (series > 3200).mean(); 
data.groupby(['type', 'name']).values.agg({'min': min, 'max': max, 'calculation': func})

The mean of a boolean vector gives the percentage of True values, which is a nicer way of calculating it. Also, you can pass common function names such as min and max in as strings.

JoeCondron
  • 8,546
  • 3
  • 27
  • 28