1

I need to group the data by year, place and the interval of price (step size is 5). For each group I want to estimate median level

df = 
year   place   price   level
1994   AAA     90      1
1993   BBB     89      1
1994   AAA     91      2
1998   AAA     92      3
1990   BBB     80      0
1994   AAA     90      1
1990   BBB     81      0
1991   BBB     92      1

I can group data and calculate the median values of level, however I do not know how to add intervals of price:

grouped_df = df.groupby(["year","place"]).agg({'level':'median'}).reset_index()

The correct grouped_df should be structured as follows (the numbers might be different, it's just an example of the data structure):

grouped_df = 

year   place   price_min   price_max   level
1990   AAA     80          85          1
...

UPDATE:

Final result should be something like this. So, basically price_min and price_max are lower and upper bounds, respectivelly:

   year_ place_  level_median price_min price_max
0   1990    BBB             0  75       80
1   1991    BBB             1  80       85
2   1993    BBB             1  85       90
3   1994    AAA             1  85       90
4   1998    AAA             3  90       95
Dinosaurius
  • 8,306
  • 19
  • 64
  • 113
  • `df2 = df.groupby(["year","place"]).agg({'level':'median','price':[np.min, np.max]}).reset_index()`? – Chuck May 07 '17 at 19:27

2 Answers2

2

I think you need aggregate + cut:

But if same output need another categories - add 5. Not ideal, but columns was converted to int and with mask added 5 and then convert to categorical back.

bins = range(0, df['price'].max() + 5, 5)
labels_low = range(0, df['price'].max(), 5)

df2['price_min'] = pd.cut(df2['price_min'], bins=bins, labels=labels_low)
df2['price_max'] = pd.cut(df2['price_max'], bins=bins, labels=labels_low).astype(int)
mask = df2['price_min'] == df2['price_max']
df2['price_max'] = df2['price_max'].mask(mask, df2['price_max'] + 5).astype('category')

print (df2)
   year_ place_  level_median  price_min  price_max
0   1990    BBB             0         75         80
1   1991    BBB             1         90         95
2   1993    BBB             1         85         90
3   1994    AAA             1         85         90
4   1998    AAA             3         90         95

Solution without categorical:

df2['price_min'] = pd.cut(df2['price_min'], bins=bins, labels=labels_low).astype(int)
df2['price_max'] = pd.cut(df2['price_max'], bins=bins, labels=labels_low).astype(int)
mask = df2['price_min'] == df2['price_max']
df2['price_max'] = df2['price_max'].mask(mask, df2['price_max'] + 5)
print (df2)
   year_ place_  level_median  price_min  price_max
0   1990    BBB             0         75         80
1   1991    BBB             1         90         95
2   1993    BBB             1         85         90
3   1994    AAA             1         85         90
4   1998    AAA             3         90         95
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Probably I have not explained well, but I wrote in my question that the step size in the interval should be 5. So, it should be (80-85], (85-90],(90-95]. Of course it should not be hardcoded. – Dinosaurius May 07 '17 at 19:33
  • Thanks. price_min and price_max are lower and upper bounds of each interval. So, basically final result is very similar to this one, but just price_min and price_max should be numbers. – Dinosaurius May 07 '17 at 19:39
  • What is the difference between `min` and `price_min`? – Dinosaurius May 07 '17 at 20:14
  • In the second row both `price_min` and `price_max` are equal to 90, while `min1` and `max1` are equal to 92. I'm not sure that I understand this point. – Dinosaurius May 07 '17 at 20:16
  • Hmmm, why in second low with `92` are intervals in your question `80 85`? – jezrael May 07 '17 at 20:41
  • How do you create `df2`? – Dinosaurius May 08 '17 at 12:03
2

I think if you wanted price grouped in a range of 5, I would use pd.cut to create the ranges then groupby.

df.groupby(['year','place',pd.cut(df.price,[79,86,91,96,101],labels=['80-85','86-90','91-95','96-100'])]).agg({'price':['min','max'],'level':'median'})

Output:

                 price      level
                   min max median
year place price                 
1990 BBB   80-85    80  81      0
1991 BBB   91-95    92  92      1
1993 BBB   86-90    89  89      1
1994 AAA   86-90    90  91      1
1998 AAA   91-95    92  92      3

Let's try this:

df2 = df.groupby(['year',
                  'place',
                   pd.cut(df.price,[i for i in range(np.min(df.price)-5,np.max(df.price)+5,5)])]
                ).agg({'price':['min','max'],'level':'median'})
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Is there any way to automate it? In my real data I have a lot of intervals. – Dinosaurius May 07 '17 at 19:35
  • Min and max expected value? – Scott Boston May 07 '17 at 19:36
  • What do you mean? price_min is the lower bound of an interval, and price_max is the upper bound. So, two columns related to price. – Dinosaurius May 07 '17 at 19:36
  • @ScottBoston Great answer, I used your answer to extend mine so I deleted it. Wouldn't be fair otherwise! I added `.reset_index().sort_values([('price','min price')])` onto the end to get it in ascending order. Just thought I'd leave it here for OP Incase they need it. Have a pleasant day :) – Chuck May 07 '17 at 20:50