5

I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,

test = pd.qcut(ebola.prob,5).value_counts()

this returns

[0.044, 0.094]    111
(0.122, 0.146]    104
(0.106, 0.122]    103
(0.146, 0.212]     92
(0.094, 0.106]     89

My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?

I have tried

test.value_counts(sort=False)

This returns

104    1
89     1
92     1
103    1
111    1

Is this the order 104,89,92,103,111? for each quintile?

I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?

oldtimetrad
  • 145
  • 2
  • 13

1 Answers1

7

What you're doing is essentially correct but you might have two issues:

  1. I think you are using pd.cut() instead of pd.qcut().
  2. You are applying value_counts() one too many times.

(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.

Here is some random data to illustrate (2):

>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()

(0.00917, 1.2]       47
(-1.182, 0.00917]    34
(1.2, 2.391]          9
(-2.373, -1.182]      8
(-3.569, -2.373]      2

Adding the sort flag will get you what you want

>>> pd.cut(df.prob, 5).value_counts(sort=False)

(-3.569, -2.373]      2
(-2.373, -1.182]      8
(-1.182, 0.00917]    34
(0.00917, 1.2]       47
(1.2, 2.391]          9

or with pd.qcut()

>>> pd.qcut(df.prob, 5).value_counts(sort=False)

[-3.564, -0.64]     20
(-0.64, -0.0895]    20
(-0.0895, 0.297]    20
(0.297, 0.845]      20
(0.845, 2.391]      20
Community
  • 1
  • 1
o-90
  • 17,045
  • 10
  • 39
  • 63
  • Thanks I was using value_counts() one too many times. I amended to test = pd.qcut(ebola.prob,5).value_counts(sort=False) – oldtimetrad Oct 31 '15 at 16:18