4

Could this be a bug? When I used describe() or std() for a groupby object, I get different answers

import pandas as pd
import numpy as np
import random as rnd

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     ...:                           'foo', 'bar', 'foo', 'foo'],
     ...:                    'B' : ['one', 'one', 'two', 'three',
     ...:                           'two', 'two', 'one', 'three'],
     ...:                    'C' : 1*(np.random.randn(8)>0.5),
     ...:                    'D' : np.random.randn(8)})
df.head()

df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each    group value of C is constant, so that makes sense. 

df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong
OzgunBu
  • 53
  • 3

3 Answers3

1

It makes sense. In the second case, you only compute the std of column D.

How? That's just how the groupby works. You

  1. slice on C and D
  2. groupby on C
  3. call GroupBy.std

At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.

As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.

Run this and it'll become clear.

df[['C','D']].groupby(['C']).std()

          D
C          
0  0.998201
1       NaN

When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,

df[['C','D']].groupby(['C'])[['C', 'D']].std()

     C         D
C               
0  0.0  0.998201
1  NaN       NaN

Which is exactly what describe gives, and what you're looking for.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks COLDSPEED, I see the problem. Though I still find it visually confusing. Thanks for teaching the trick with index being inserted as the column. – OzgunBu Mar 22 '18 at 19:12
1

My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:

https://github.com/pandas-dev/pandas/issues/10355

OzgunBu
  • 53
  • 3
-1

Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -

import pandas as pd
import numpy as np
import random as rnd

np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     'foo', 'bar', 'foo', 'foo'],
     'B' : ['one', 'one', 'two', 'three',
     'two', 'two', 'one', 'three'],
     'C' : 1*(np.random.randn(8)>0.5),
     'D' : np.random.randn(8)})
df

df[['C','D']].groupby(['C'],as_index=False).describe()

enter image description here

df[['C','D']].groupby(['C'],as_index=False).std()

enter image description here

To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,

def describe_numeric_1d(series):
            stat_index = (['count', 'mean', 'std', 'min'] +
                          formatted_percentiles + ['max'])
            d = ([series.count(), series.mean(), series.std(), series.min()] +
                 [series.quantile(x) for x in percentiles] + [series.max()])
            return pd.Series(d, index=stat_index, name=series.name)

Above code shows that describe just shows the result of std() only

Aritesh
  • 1,985
  • 1
  • 13
  • 17
  • 1
    I don't really see an answer to the question. – cs95 Mar 22 '18 at 05:18
  • The second row under the C column is exactly what makes me confused (0,1 not 0,0). Thanks for putting the time to turn it into code and run. – OzgunBu Mar 22 '18 at 19:13