What are 25%,50%,75% values when we describe a grouped dataframe?

Question

I am going through pandas groupby docs and when I groupby on particular column as below:

df:

     A      B         C         D
0  foo    one -0.987674  0.039616
1  bar    one -0.653247 -1.022529
2  foo    two  0.404201  1.308777
3  bar  three  1.620780  0.574377
4  foo    two  1.661942  0.579888
5  bar    two  0.747878  0.463052
6  foo    one  0.070278  0.202564
7  foo  three  0.779684 -0.547192

grouped=df.groupby('A')
grouped.describe(A)

gives

              C                      ...         D                    
          count      mean       std  ...       50%       75%       max
A   B                                ...                              
bar one     1.0  0.224944       NaN  ...  1.107509  1.107509  1.107509
    three   1.0  0.704943       NaN  ...  1.833098  1.833098  1.833098
    two     1.0 -0.091613       NaN  ... -0.549254 -0.549254 -0.549254
foo one     2.0  0.282298  1.554401  ... -0.334058  0.046640  0.427338
    three   1.0  1.688601       NaN  ... -1.457338 -1.457338 -1.457338
    two     2.0  1.206690  0.917140  ... -0.096405  0.039241  0.174888

what 25%,50%,75% signifies when described? a bit of explaination please?

sorry, I am not looking for o/p expand , when described what are that 25% 50% values mean, how are they achieved? — KcH, Sep 10 '19 at 11:41
@jezrael May be a duplicate question but the redirected one doesn't provide answer for my question mate — KcH, Sep 10 '19 at 11:43
@jezrael I am not looking for display options mate.....I am thinking of those values under 50% and 75% in above described dataframe — KcH, Sep 10 '19 at 11:47
yep it's working , as min gives minimum value in similar way what is 50% and 75% values mean? how we get those? — KcH, Sep 10 '19 at 11:51

score 10 · Answer 1 · answered Apr 28 '20 at 12:18

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

Babatunde Mustapha · Answer 2 · 2021-05-13T17:01:32.243

In simple words...

You will see the percentiles(25%, 50%, 75%..etc) and some values in front of them.

The significance is to tell you the distribution of your data.

For example:

s = pd.Series([1, 2, 3, 1])

s.describe()   will give

count    4.000000
mean     1.750000
std      0.957427
min      1.000000
25%      1.000000
50%      1.500000
75%      2.250000
max      3.000000

25% means 25% of your data have the value 1.0000 or below. That is if you were to look at your data manually, 25% of it is less than or equal 1. (you will agree with this if you look at our data [1, 2, 3, 1]. [1] which is 25% of the data is less than or equal to 1.

50% means 50% of your data have the value 1.5 or below. [1, 1] which constitute 50% of the data are less than or equal 1.5.

75% means 75% of your data have the value 2.25 or below. [1, 2, 1] which constitute 75% of the data are less than or equal 2.25.

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

You can test DataFrameGroupBy.describe:

Notes:

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

can you explain for foo-one value for above eg?

It is called Mulitindex:

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

grouped=df.groupby(['A', 'B'])
df = grouped.describe()

print (df.index)
MultiIndex([('bar',   'one'),
            ('bar', 'three'),
            ('bar',   'two'),
            ('foo',   'one'),
            ('foo', 'three'),
            ('foo',   'two')],
           names=['A', 'B'])

print (df.columns)
MultiIndex([('C', 'count'),
            ('C',  'mean'),
            ('C',   'std'),
            ('C',   'min'),
            ('C',   '25%'),
            ('C',   '50%'),
            ('C',   '75%'),
            ('C',   'max'),
            ('D', 'count'),
            ('D',  'mean'),
            ('D',   'std'),
            ('D',   'min'),
            ('D',   '25%'),
            ('D',   '50%'),
            ('D',   '75%'),
            ('D',   'max')],
           )

print (df.loc[('foo','one'), ('C', '75%')])
-0.19421

@Codenewbie - hmmm, maybe the easier exalin it is for combiantion of `foo` and `one` created new row of final dataframe filled by function counted by `describe` — jezrael, Sep 10 '19 at 12:25
@Codenewbie - yes, because there is more function like `count`, `mean`, `std` for each numeric column `C`, `D` there is again `MultiIndex` in columns - so for `df.loc[('foo','one'), ('C', '75%')]` get values from `DataFrame` — jezrael, Sep 10 '19 at 12:30
I am worried how the 75% value is -0.19421? how it gets calculated? learnt it is quantile and it is little complex — KcH, Sep 10 '19 at 13:04

Grijesh Chauhan · Answer 4 · 2021-08-27T01:43:01.660

3

_{old question but adding an answer so that one can find help:}

In my annotated version of Pandas books, I explained significance of 25%, 50% and 75% values in .describe() output, which exactly answer to the question: attached:

if one need my annotated version I can share.

edited Aug 27 '21 at 01:43

answered Aug 23 '21 at 07:55

Grijesh Chauhan

57,103
20
141
208

PV8 · Answer 5 · 2019-09-10T12:09:51.063

2

You are seeing the quantiles of your dataframe: https://en.wikipedia.org/wiki/Quantile

for example 25-%-Quantil:

25% of all your values are below that value

In your case:

A= bar
B= one

has a 75% quantile of 1.107509 which means that 75% of your data entries for type D in group (bar and one) is under this value.

edited Sep 10 '19 at 12:09

answered Sep 10 '19 at 11:58

PV8

5,799
7
43
87

makes sense but i could not figure it , a little complex!! – KcH Sep 10 '19 at 12:12

What are 25%,50%,75% values when we describe a grouped dataframe?

5 Answers5