2

I am testing a very basic line of code.

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10)

This gives me ranges of rating scores in 10 buckets. Instead of the range, how can I see 1, 2, 3, etc., up to 10?

So, instead of this.

      Score RatingQuantile  
0     (26.3, 29.0]  
6     (23.6, 26.3]  
7     (23.6, 26.3]  
8     (26.3, 29.0]  
10    (18.2, 20.9]  
       ...       ...  
9763  (23.6, 26.3]  
9769  (20.9, 23.6]  
9829  (20.9, 23.6]  
9889  (23.6, 26.3]  
9949  (20.9, 23.6] 

How can I get something like this?

      Score RatingQuantile  
0     10  
6     8 
7     8 
8     10  
10    6  
       ...      ...  
9763  8  
9769  5  
9829  5 
9889  5  
9949  5 

I tried this.

modDF['DecileRank'] = pd.qcut(modDF['RatingScore'],10,labels=False)

I got this error.

ValueError: Bin edges must be unique: array([ 2., 20., 25., 27., 27., 27., 27., 27., 27., 27., 29.]).
You can drop duplicate edges by setting the 'duplicates' kwarg

The error makes sense to me. I just don't know the work-around for this issue. Thoughts?

martineau
  • 119,623
  • 25
  • 170
  • 301
ASH
  • 20,759
  • 19
  • 87
  • 200
  • What is RatingScore? I do not seem to have a similar issue. Can you post the original data you are working with please? – Celius Stingher Jan 21 '20 at 14:12
  • Rating Scores look like this: 27 26 26 27 20 3 9 23 23 26 30 It's a bunch of number, from 0-30. I think the problem here is that these scores overlap, and that's throwing the error. – ASH Jan 21 '20 at 14:25

2 Answers2

1

I do not have issue working with qcut() if passing a series. I assume your data looks like the one I am using.

import pandas as pd
import numpy as np
data = {'values':np.random.randint(1,30,size=1000)}
df = pd.DataFrame(data)
df['ranks'] = pd.qcut(df['values'],10,labels=False)
print(df)

Output:

     values  ranks
0        18      5
1        22      7
2         5      1
3        12      3
4        14      4
..      ...    ...
995      22      7
996      13      4
997      26      8
998       3      0
999      22      7

You can check afterwards for simple operations (for example the limits of the bins) using groupby() or other bunch of functions:

df_info = df.groupby('ranks').agg(
        min_score=pd.NamedAgg(column='values',aggfunc='min'),
        max_score=pd.NamedAgg(column='values',aggfunc='max'),
        count_cases=pd.NamedAgg(column='values',aggfunc='count'))
print(df_info)

Output:

       min_score  max_score  count_cases
ranks                                   
0              1          3          137
1              4          5           72
2              6          8          105
3              9         11           96
4             12         14           98
5             15         17          107
6             18         20           91
7             21         23           99
8             24         27          121
9             28         29           74
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
1

I think what you are looking for is this:

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10, labels=range(1,11))
# or
modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 10, labels=False)

From the docs:

labels : array or bool, optional
Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex.

Furthermore, if you want to "cover" the whole interval [0,30], specify the bin edges:

import numpy as np

modDF['RatingDecile'] = pd.cut(modDF['RatingScore'], 
                               bins=np.linspace(0, 30, 11), labels=False)

Caveat: Note that cut is not the same as qcut.

iipr
  • 1,190
  • 12
  • 17