15

I am going through pandas groupby docs and when I groupby on particular column as below:

df:

     A      B         C         D
0  foo    one -0.987674  0.039616
1  bar    one -0.653247 -1.022529
2  foo    two  0.404201  1.308777
3  bar  three  1.620780  0.574377
4  foo    two  1.661942  0.579888
5  bar    two  0.747878  0.463052
6  foo    one  0.070278  0.202564
7  foo  three  0.779684 -0.547192

grouped=df.groupby('A')
grouped.describe(A)

gives

              C                      ...         D                    
          count      mean       std  ...       50%       75%       max
A   B                                ...                              
bar one     1.0  0.224944       NaN  ...  1.107509  1.107509  1.107509
    three   1.0  0.704943       NaN  ...  1.833098  1.833098  1.833098
    two     1.0 -0.091613       NaN  ... -0.549254 -0.549254 -0.549254
foo one     2.0  0.282298  1.554401  ... -0.334058  0.046640  0.427338
    three   1.0  1.688601       NaN  ... -1.457338 -1.457338 -1.457338
    two     2.0  1.206690  0.917140  ... -0.096405  0.039241  0.174888

what 25%,50%,75% signifies when described? a bit of explaination please?

KcH
  • 3,302
  • 3
  • 21
  • 46
  • sorry, I am not looking for o/p expand , when described what are that 25% 50% values mean, how are they achieved? – KcH Sep 10 '19 at 11:41
  • @jezrael May be a duplicate question but the redirected one doesn't provide answer for my question mate – KcH Sep 10 '19 at 11:43
  • @jezrael I am not looking for display options mate.....I am thinking of those values under 50% and 75% in above described dataframe – KcH Sep 10 '19 at 11:47
  • yep it's working , as min gives minimum value in similar way what is 50% and 75% values mean? how we get those? – KcH Sep 10 '19 at 11:51

5 Answers5

10

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

SIBBIR AHMED
  • 101
  • 1
  • 2
10

In simple words...

You will see the percentiles(25%, 50%, 75%..etc) and some values in front of them.

The significance is to tell you the distribution of your data.

For example:

s = pd.Series([1, 2, 3, 1])

s.describe()   will give

count    4.000000
mean     1.750000
std      0.957427
min      1.000000
25%      1.000000
50%      1.500000
75%      2.250000
max      3.000000

25% means 25% of your data have the value 1.0000 or below. That is if you were to look at your data manually, 25% of it is less than or equal 1. (you will agree with this if you look at our data [1, 2, 3, 1]. [1] which is 25% of the data is less than or equal to 1.

50% means 50% of your data have the value 1.5 or below. [1, 1] which constitute 50% of the data are less than or equal 1.5.

75% means 75% of your data have the value 2.25 or below. [1, 2, 1] which constitute 75% of the data are less than or equal 2.25.

Babatunde Mustapha
  • 2,131
  • 20
  • 21
3

You can test DataFrameGroupBy.describe:

Notes:

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.


can you explain for foo-one value for above eg?

It is called Mulitindex:

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

grouped=df.groupby(['A', 'B'])
df = grouped.describe()

print (df.index)
MultiIndex([('bar',   'one'),
            ('bar', 'three'),
            ('bar',   'two'),
            ('foo',   'one'),
            ('foo', 'three'),
            ('foo',   'two')],
           names=['A', 'B'])

print (df.columns)
MultiIndex([('C', 'count'),
            ('C',  'mean'),
            ('C',   'std'),
            ('C',   'min'),
            ('C',   '25%'),
            ('C',   '50%'),
            ('C',   '75%'),
            ('C',   'max'),
            ('D', 'count'),
            ('D',  'mean'),
            ('D',   'std'),
            ('D',   'min'),
            ('D',   '25%'),
            ('D',   '50%'),
            ('D',   '75%'),
            ('D',   'max')],
           )

print (df.loc[('foo','one'), ('C', '75%')])
-0.19421
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • can you explain for foo-one value for above eg? – KcH Sep 10 '19 at 11:57
  • @Codenewbie - hmmm, maybe the easier exalin it is for combiantion of `foo` and `one` created new row of final dataframe filled by function counted by `describe` – jezrael Sep 10 '19 at 12:25
  • I meant foo and one respective '%' values of 'C' and 'D' – KcH Sep 10 '19 at 12:28
  • @Codenewbie - yes, because there is more function like `count`, `mean`, `std` for each numeric column `C`, `D` there is again `MultiIndex` in columns - so for `df.loc[('foo','one'), ('C', '75%')]` get values from `DataFrame` – jezrael Sep 10 '19 at 12:30
  • I am worried how the 75% value is -0.19421? how it gets calculated? learnt it is quantile and it is little complex – KcH Sep 10 '19 at 13:04
3

old question but adding an answer so that one can find help:

In my annotated version of Pandas books, I explained significance of 25%, 50% and 75% values in .describe() output, which exactly answer to the question: attached:

enter image description here

if one need my annotated version I can share.

Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
2

You are seeing the quantiles of your dataframe: https://en.wikipedia.org/wiki/Quantile

for example 25-%-Quantil:

25% of all your values are below that value

In your case:

A= bar
B= one

has a 75% quantile of 1.107509 which means that 75% of your data entries for type D in group (bar and one) is under this value.

PV8
  • 5,799
  • 7
  • 43
  • 87