-1

I have a dataframe containing customer data. Each row pertains to a different customer. One of the dataframe's series is 2016 sales to the customer. I'd like to categorizes the customers into deciles. Each decile will have one tenth of the total sales. Decile 10 will have the top customers which account for the first 10% of total sales.

What's the most python way to accomplish this task?

Steve Maughan
  • 1,174
  • 3
  • 19
  • 30

1 Answers1

4

pandas has a pd.qcut method which, when applied to a Series, returns a categorical series (a DataFrame behaves similarly). So to get back a categorical index, you can do:

>>> import pandas as pd
# mock data
>>> df = pd.DataFrame({'sales_total': [1,2,162,126,126,12,7,1236,16,132,61,51]})
>>> cat_srs = pd.qcut(df['sales_total'], 10)
>>> print(cat_srs)
<<< 0         [1, 2.5]
    1         [1, 2.5]
    2      (159, 1236]
    3         (1, 126]
    4         (1, 126]
    5        (8, 13.2]
    6         (2.5, 8]
    7      (159, 1236]
    8        (13.2, 3]
    9     (130.8, 159]
    10         (56, 1]
    11         (3, 56]
    Name: dat, dtype: category
    Categories (10, object): [...]

You can get the series entries with the values operator, and then from there get their bin code using the codes operator:

>>> deciles = cat_srs.values.codes
>>> print(deciles)
<<< array([0, 0, 9, 6, 6, 2, 1, 9, 3, 8, 5, 4], dtype=int8)

Which is what you need. From here you could assign the deciles to the data using df['decile'] = deciles, group entries using df.groupby('decile'), and so on.

The one liner for all of the above is pd.qcut(df['sales_total'], 10).values.codes.

Edit: answering the modified question below, per the comments—I don't know a way of doing this that's baked into a library. But assuming your data is relatively continuous, you can build classes yourself like so:

target = df['sales_total'].sum() / 10
deciles = []
sum = 0
classifier = 0
for val in df['sales_total']:
    deciles.append(classifier)
    sum += val
    if sum > target:
        classifier += 1
        sum = 0
Aleksey Bilogur
  • 3,686
  • 3
  • 30
  • 57
  • Thanks for this. If I'm not mistaken, the `pd.qcut` function is based on percentiles i.e. an equal number of elements in each bin. I'm looking for the sum of sales in each bin to be approximately equal. Decile 10 will have the fewest elements (all with high sales), while Decile 1 will have the most elements (and all with low sales). – Steve Maughan Jan 28 '17 at 22:44
  • Ah, in that case you're trying to solve an NP-hard problem known as the [partition problem](https://en.wikipedia.org/wiki/Partition_problem). If you have a very small amount of data, you can get away with the solution [here](http://stackoverflow.com/questions/10035752/elegant-python-code-for-integer-partitioning). I don't know of a canonical implementation in Python, but you can probably reuse the code [here](https://people.sc.fsu.edu/~jburkardt/py_src/partition_problem/partition_problem.html). – Aleksey Bilogur Jan 28 '17 at 22:54
  • Sorry, I'm over-complicating it. If I was doing it in Excel I'd sort the accounts based on the sales. Decile 10 would be the top accounts whose sales totaled 1/10th of the total. The Decile 9 would be the next group of accounts whose sales totaled 1/10th of the total etc. – Steve Maughan Jan 28 '17 at 23:02
  • Thanks. This looks like the way to go. – Steve Maughan Jan 29 '17 at 18:01