1

I am trying to do descriptive statistics of a DataFrame using GroupBy, and put those values back into the DataFrame.

My DataFrame contains a non-unique running number which identifies a person (anonymously), and some values connected to each person.

Eg:

RunNr    Value
1        126
1        158
1        18
2        65
3        31   
3        4

By using GroupBy I can calculate descriptive statistics for each person(running number), like standard deviation. I want to add these back into the DataFrame for further processing (like making a report in Word).

The result should look like this:

RunNr    Value    Std
1        126      59,9
1        158      59,9
1        18       59,9
2        65       Nan
3        31       13,5
3        4        13,5

The best solution I have come up with is to calculate standard deviation (and other statistics), put these into a dictionary where the running number is the key and the value is the standard deviation.

I now have a dictionary where the running number in the dictionary is a unique key, while it is not in the DataFrame. My next step is to iterate over the dictionary, and use .loc() to insert the corresponding value into the correct row:

for key, value in self.dict_of_std:
    self.internal_main_df.loc[self.internal_main_df.Fnr == key] = value

I am getting this error:

TypeError: cannot unpack non-iterable float object

Suggestions to improve my code, or my overall method is appreciated.

1 Answers1

1

If need one column filled by std per groups use GroupBy.transform with one aggregate function, here std:

df['Std'] = df.groupby('RunNr')['Value'].transform('std')
print (df)
   RunNr  Value        Std
0      1    126  73.357572
1      1    158  73.357572
2      1     18  73.357572
3      2     65        NaN
4      3     31  19.091883
5      3      4  19.091883

If need more statistics is possible use DataFrameGroupBy.describe with DataFrame.join for new columns:

df1 = df.join(df.groupby('RunNr')['Value'].describe(), on='RunNr')
print (df1)
   RunNr  Value  count        mean        std   min    25%    50%     75%  \
0      1    126    3.0  100.666667  73.357572  18.0  72.00  126.0  142.00   
1      1    158    3.0  100.666667  73.357572  18.0  72.00  126.0  142.00   
2      1     18    3.0  100.666667  73.357572  18.0  72.00  126.0  142.00   
3      2     65    1.0   65.000000        NaN  65.0  65.00   65.0   65.00   
4      3     31    2.0   17.500000  19.091883   4.0  10.75   17.5   24.25   
5      3      4    2.0   17.500000  19.091883   4.0  10.75   17.5   24.25   

     max  
0  158.0  
1  158.0  
2  158.0  
3   65.0  
4   31.0  
5   31.0   

Or is possible specify aggregate functions in GroupBy.agg function:

df2 = df.join(df.groupby('RunNr')['Value'].agg(['mean','max','std']),  on='RunNr')
print (df2)
   RunNr  Value        mean  max        std
0      1    126  100.666667  158  73.357572
1      1    158  100.666667  158  73.357572
2      1     18  100.666667  158  73.357572
3      2     65   65.000000   65        NaN
4      3     31   17.500000   31  19.091883
5      3      4   17.500000   31  19.091883
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Whow, that was fast. I tried running your code, and I got: AttributeError: 'SeriesGroupBy' object has no attribute 'Std' – CenturionNOR Jun 18 '19 at 07:52
  • @CenturionNOR - not typo `Std` instead `std` ? – jezrael Jun 18 '19 at 07:53
  • Also I am trying to understand the code. As far as I see, when I run groupby I create a new DataFrame object (as opposed to a groupby-object etc.) and can run the same methods as on a regular DataFrame; in this case transform () [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html) – CenturionNOR Jun 18 '19 at 07:56
  • @CenturionNOR - `transform` working like aggregation data by `.agg` and then create new columns filled by aggregate values. More infiormation about aggregation is [here](https://stackoverflow.com/questions/53781634/aggregation-in-pandas) – jezrael Jun 18 '19 at 07:58
  • 1
    This worked like magic! Thank you for taking the time to elaborate the answer and just not posting the code, this helps me understand and become a better programmer. – CenturionNOR Jun 18 '19 at 08:00