Categorize Customers into Deciles using Python Pandas?

Question

I have a dataframe containing customer data. Each row pertains to a different customer. One of the dataframe's series is 2016 sales to the customer. I'd like to categorizes the customers into deciles. Each decile will have one tenth of the total sales. Decile 10 will have the top customers which account for the first 10% of total sales.

What's the most python way to accomplish this task?

Yes - here's some [Test Data](https://www.dropbox.com/s/0ga6kgptxpzjs1i/Test-Data.csv?dl=0). It has three series - account ID, zip code and sales. The sales series is the one to use to classify the accounts — Steve Maughan, Jan 29 '17 at 17:59
I'd recommend you to provide a short sample data set (3-7 rows) and desired data set based on the input one. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Jan 29 '17 at 18:12

Aleksey Bilogur · Accepted Answer · 2017-01-31T17:34:19.870

pandas has a pd.qcut method which, when applied to a Series, returns a categorical series (a DataFrame behaves similarly). So to get back a categorical index, you can do:

>>> import pandas as pd
# mock data
>>> df = pd.DataFrame({'sales_total': [1,2,162,126,126,12,7,1236,16,132,61,51]})
>>> cat_srs = pd.qcut(df['sales_total'], 10)
>>> print(cat_srs)
<<< 0         [1, 2.5]
    1         [1, 2.5]
    2      (159, 1236]
    3         (1, 126]
    4         (1, 126]
    5        (8, 13.2]
    6         (2.5, 8]
    7      (159, 1236]
    8        (13.2, 3]
    9     (130.8, 159]
    10         (56, 1]
    11         (3, 56]
    Name: dat, dtype: category
    Categories (10, object): [...]

You can get the series entries with the values operator, and then from there get their bin code using the codes operator:

>>> deciles = cat_srs.values.codes
>>> print(deciles)
<<< array([0, 0, 9, 6, 6, 2, 1, 9, 3, 8, 5, 4], dtype=int8)

Which is what you need. From here you could assign the deciles to the data using df['decile'] = deciles, group entries using df.groupby('decile'), and so on.

The one liner for all of the above is pd.qcut(df['sales_total'], 10).values.codes.

Edit: answering the modified question below, per the comments—I don't know a way of doing this that's baked into a library. But assuming your data is relatively continuous, you can build classes yourself like so:

target = df['sales_total'].sum() / 10
deciles = []
sum = 0
classifier = 0
for val in df['sales_total']:
    deciles.append(classifier)
    sum += val
    if sum > target:
        classifier += 1
        sum = 0

Thanks for this. If I'm not mistaken, the `pd.qcut` function is based on percentiles i.e. an equal number of elements in each bin. I'm looking for the sum of sales in each bin to be approximately equal. Decile 10 will have the fewest elements (all with high sales), while Decile 1 will have the most elements (and all with low sales). — Steve Maughan, Jan 28 '17 at 22:44
Ah, in that case you're trying to solve an NP-hard problem known as the [partition problem](https://en.wikipedia.org/wiki/Partition_problem). If you have a very small amount of data, you can get away with the solution [here](http://stackoverflow.com/questions/10035752/elegant-python-code-for-integer-partitioning). I don't know of a canonical implementation in Python, but you can probably reuse the code [here](https://people.sc.fsu.edu/~jburkardt/py_src/partition_problem/partition_problem.html). — Aleksey Bilogur, Jan 28 '17 at 22:54
Sorry, I'm over-complicating it. If I was doing it in Excel I'd sort the accounts based on the sales. Decile 10 would be the top accounts whose sales totaled 1/10th of the total. The Decile 9 would be the next group of accounts whose sales totaled 1/10th of the total etc. — Steve Maughan, Jan 28 '17 at 23:02

Categorize Customers into Deciles using Python Pandas?

1 Answers1