std() groupby Pandas issue

Question

Could this be a bug? When I used describe() or std() for a groupby object, I get different answers

import pandas as pd
import numpy as np
import random as rnd

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     ...:                           'foo', 'bar', 'foo', 'foo'],
     ...:                    'B' : ['one', 'one', 'two', 'three',
     ...:                           'two', 'two', 'one', 'three'],
     ...:                    'C' : 1*(np.random.randn(8)>0.5),
     ...:                    'D' : np.random.randn(8)})
df.head()

df[['C','D']].groupby(['C'],as_index=False).describe()
# this line gives me the standard deviation of 'C' to be 0,0. Within each    group value of C is constant, so that makes sense. 

df[['C','D']].groupby(['C'],as_index=False).std()
# This line gives me the standard deviation of 'C' to be 0,1. I think this is wrong

cs95 · Accepted Answer · 2018-03-22T05:18:42.000

It makes sense. In the second case, you only compute the std of column D.

How? That's just how the groupby works. You

slice on C and D
groupby on C
call GroupBy.std

At step 3, you did not specify any column, so std was assumed to be computed on the column that was not the grouper... aka, column D.

As for why you see C with 0, 1... that's because you specify as_index=False, so the C column is inserted with values coming in from the original dataFrame... which in this case is 0, 1.

Run this and it'll become clear.

df[['C','D']].groupby(['C']).std()

          D
C          
0  0.998201
1       NaN

When you specify as_index=False, the index you see above is inserted as a column. Contrast this with,

df[['C','D']].groupby(['C'])[['C', 'D']].std()

     C         D
C               
0  0.0  0.998201
1  NaN       NaN

Which is exactly what describe gives, and what you're looking for.

Thanks COLDSPEED, I see the problem. Though I still find it visually confusing. Thanks for teaching the trick with index being inserted as the column. — OzgunBu, Mar 22 '18 at 19:12

score 1 · Answer 2 · answered Apr 12 '18 at 18:07

My friend mukherjees and I have done my more trials with this one and decided that there is really an issue with std(). You can see in the following link, how we show "std() is not the same as .apply(np.std, ddof=1). " After noticing, we also found the following related bug report:

https://github.com/pandas-dev/pandas/issues/10355

Aritesh · Answer 3 · 2018-03-22T04:32:42.363

Even with the std(), you will get the zero standard deviation of C within each group. I just added a seed to your code to make it replicable. I am not sure what is the issue -

import pandas as pd
import numpy as np
import random as rnd

np.random.seed=1987
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
     'foo', 'bar', 'foo', 'foo'],
     'B' : ['one', 'one', 'two', 'three',
     'two', 'two', 'one', 'three'],
     'C' : 1*(np.random.randn(8)>0.5),
     'D' : np.random.randn(8)})
df

df[['C','D']].groupby(['C'],as_index=False).describe()

df[['C','D']].groupby(['C'],as_index=False).std()

To go further deep, if you look at the source code of describe for groupby which inherits from DataFrame.describe,

def describe_numeric_1d(series):
            stat_index = (['count', 'mean', 'std', 'min'] +
                          formatted_percentiles + ['max'])
            d = ([series.count(), series.mean(), series.std(), series.min()] +
                 [series.quantile(x) for x in percentiles] + [series.max()])
            return pd.Series(d, index=stat_index, name=series.name)

Above code shows that describe just shows the result of std() only

The second row under the C column is exactly what makes me confused (0,1 not 0,0). Thanks for putting the time to turn it into code and run. — OzgunBu, Mar 22 '18 at 19:13

std() groupby Pandas issue

3 Answers3