0

I have a data frame with sales by month and the price at each month's observation. Is there a way to take those prices, and break them up into 10 equal groups based on the price itself, so group 1 would be 10% with the lowest prices, and group 10 be 10% of the observations with the highest prices?

I tried pd.cut but I need to know where those 10% cut points are to bin correctly and I'm not sure how to get that when I have over 30K rows of data. I also need equal number of observations in each group.

Here is some test data:

from scipy.stats import poisson
import pandas as pd

data = poisson.rvs(mu=42.73, size=10000)
data_ = pd.DataFrame(data, columns=['price'])

Thank you.

Jordan
  • 1,415
  • 3
  • 18
  • 44
  • [`pd.qcut(data_['price'], 10)`](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html). – Quang Hoang Aug 29 '22 at 13:40
  • My apologies as I forgot to add I needed the same amount of observations in each group. Then add a column denoting the price group each observation is in. – Jordan Aug 29 '22 at 13:44
  • `data_.groupby(pd.qcut(data_['price'], 10)).value_counts()`? – Quang Hoang Aug 29 '22 at 13:46
  • with only a relatively small number of discrete values (in comparison to the total number of observations) it'll be impossible to get equally sized groups – Stef Aug 29 '22 at 14:02

1 Answers1

1

First let's create a dataframe to use as an example for OP's use case

import pandas as pd

sales = pd.DataFrame({'sales': [100, 200, 300, 400, 500, 600],
                        'price': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]})

Depending on one's goals one can use:

pandas.cut

sales['group'] = pd.cut(sales['price'], 10, labels=False)

[Out]:
   sales  price  group
0    100    1.0      0
1    200    2.0      1
2    300    3.0      3
3    400    4.0      5
4    500    5.0      7

pandas.qcut

sales['group'] = pd.qcut(sales['price'], 10, labels=False)

[Out]:
   sales  price  group
0    100    1.0      0
1    200    2.0      1
2    300    3.0      3
3    400    4.0      5
4    500    5.0      7

In this example the output is the same, but that is not necessarily the case. Therefore, for more information on the difference between pandas.cut and pandas.qcut, see this thread: What is the difference between pandas.qcut and pandas.cut?

Additionally, if one wants to see the labels for the resulting bins, change labels=False to labels=True.

Finally, in order to get the amount of observations in each group, one can use pandas.DataFrame.groupby. Let's create a new dataframe called group_counts

group_counts = sales.groupby('group').size().reset_index(name='counts')

[Out]:
   group  counts
0      0       1
1      1       1
2      3       1
3      5       1
4      7       1
Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
  • Thank you. This is good, but how do I ensure I get equal counts of observations into each group? I have some outliers and they skew the cut. I want 10 groups of equal (or near equal in the case of odd number observations) counts. – Jordan Aug 29 '22 at 13:55
  • 1
    @Jordan for that instead `pd.cut` one needs to use `pd.qcut`. – Gonçalo Peres Aug 29 '22 at 13:56