1

I am trying to do weighted standard deviation on top of weighted average on my pandas dataframe. I have a pandas dataframe like:

import numpy as np
import pandas as pd
df = pd.DataFrame({"Date": pd.date_range(start='2018-01-01', end='2018-01-03 18:00:00', freq='6H'),
               "Weight": np.random.uniform(3, 5, 12),
               "V1": np.random.uniform(10, 15, 12),
               "V2": np.random.uniform(10, 15, 12),
               "V3": np.random.uniform(10, 15, 12)})

Currently, to get the weighted mean, inspired by this post, I am doing the following:

def weighted_average_std(grp):
    return grp._get_numeric_data().multiply(grp['Weight'], axis=0).sum()/grp['Weight'].sum()
df.index = df["Date"]
df_agg = df.groupby(pd.Grouper(freq='1D')).apply(weighted_average_std).reset_index()
df_agg

Where I get the following:

    Date    V1  V2  V3  Weight
0   2018-01-01  11.421749   13.090178   11.639424   3.630196
1   2018-01-02  12.142917   11.605284   12.187473   4.056303
2   2018-01-03  12.034015   13.159132   11.658969   4.318753

I want to modify weighted_average_std so that it returns standard deviation for each column in addition to weighted average. The idea is to use the weighted average for each group in a vectorized fashion. The new column names for Weighted Standard Deviation can be something like V1_WSD, V2_WSD and V3_WSD.

PS1: This post goes through the theory of weighted standard deviation.

PS2: Column Weight in df_agg is meaningless.

ahoosh
  • 1,340
  • 3
  • 17
  • 31

1 Answers1

2

You could use EOL's NumPy-based code to calculate weighted averages and standard deviation. To use this in a Pandas groupby/apply operation, make weighted_average_std return a DataFrame:

import numpy as np
import pandas as pd


def weighted_average_std(grp):
    """
    Based on http://stackoverflow.com/a/2415343/190597 (EOL)
    """
    tmp = grp.select_dtypes(include=[np.number])
    weights = tmp['Weight']
    values = tmp.drop('Weight', axis=1)
    average = np.ma.average(values, weights=weights, axis=0)
    variance = np.dot(weights, (values - average) ** 2) / weights.sum()
    std = np.sqrt(variance)
    return pd.DataFrame({'mean':average, 'std':std}, index=values.columns)

np.random.seed(0)
df = pd.DataFrame({
    "Date": pd.date_range(start='2018-01-01', end='2018-01-03 18:00:00', freq='6H'),
    "Weight": np.random.uniform(3, 5, 12),
    "V1": np.random.uniform(10, 15, 12),
    "V2": np.random.uniform(10, 15, 12),
    "V3": np.random.uniform(10, 15, 12)})

df.index = df["Date"]
df_agg = df.groupby(pd.Grouper(freq='1D')).apply(weighted_average_std).unstack(-1)
print(df_agg)

yields

                 mean                             std                    
                   V1         V2         V3        V1        V2        V3
Date                                                                     
2018-01-01  12.105253  12.314079  13.566136  1.803014  1.725761  0.679279
2018-01-02  13.223172  12.534893  11.860456  1.709583  0.950338  1.153895
2018-01-03  13.782625  12.013557  12.105231  0.969099  1.189149  1.249064
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • This works great. Can you explain what `unstack(-1)` does? – ahoosh Jan 17 '18 at 19:05
  • 1
    The best way to understand [`unstack`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) is to play with examples (such as the one shown on the linked page, or perhaps the one shown here). One way to think of it is to concentrate on how it affects the row and column indexes -- `unstack` moves a row index level to a new column index level. The data is reshaped as necessary. The `-1` tells `unstack` to move the last row index level -- the `V1`, `V2`, `V3` values, not the `Date`s. – unutbu Jan 17 '18 at 19:09
  • That makes sense. I am using the following line to make the column names flat. Please let us know if there is a more elegant way to do it. It will result in `V1_std` and `V1_mean` kind of names, `df_agg.columns = [col[1] + "_" + col[0] if (col[0] != "" and col[1] != "") else col[1] + col[0] for col in df_agg.columns.values]` – ahoosh Jan 17 '18 at 19:26
  • 1
    That works, or you could use something like `df_agg.columns = df_agg.swaplevel(axis=1).columns.map('_'.join)`. Note that keeping the [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/advanced.html) can be advantageous, especially if you wish to, say, select only the means, or only the stds, or only the data associated with `V1`, for example. – unutbu Jan 17 '18 at 19:32
  • In my data, I have no data at all for some days. I get an error `Weights sum to zero, can't be normalized` in that case. That error happens inside `np.average` because `pd.Grouper` makes filler days if data for a day is missing (There is no data at all, so no rows of data for some days) Is there a way to force `Grouper` not to do that so we don't get an error? – ahoosh Jan 17 '18 at 20:29
  • 1
    Change [`np.average`](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.average.html) to `np.ma.average`. The linked doc page points to the answer (search for `ZeroDivisionError`). With this change, `df_agg` will have rows with `NaNs`. If you wish to drop these rows, use `df_agg = df_agg.dropna()`. – unutbu Jan 17 '18 at 20:34
  • Sweet. It was fixed. I just get a warning but everything else works great. – ahoosh Jan 17 '18 at 20:39