-2

I'm using a lambda function in a pandas aggregation to calculate the weighted average. My issue is that if one of the values is nan, the whole result is nan of that group. How can I avoid this?

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns = ['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df.loc['b','four'] ='foo'
df.loc['c','four'] ='foo'


        one       two     three four   five found
a  1.046540 -0.304646 -0.982008  bar   True   NaN
b       NaN       NaN       NaN  foo    NaN   foo
c -1.086525  1.086501  0.403910  foo  False   NaN
d       NaN       NaN       NaN  NaN    NaN   NaN
e  0.569420  0.105422  0.192559  bar   True   NaN
f  0.384400 -0.558321  0.324624  bar   True   NaN
g       NaN       NaN       NaN  NaN    NaN   NaN
h  0.656231 -2.185062  0.180535  bar   True   NaN


df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', lambda x: np.average(x, weights=df.loc[x.index, 'two'])))

           sum  weighted_avg
four                        
bar  -2.942608      0.648173
foo   1.086501           NaN

desired result:

           sum  weighted_avg
four                        
bar  -2.942608      0.648173
foo   1.086501     -1.086525 

Unlike this question, this is not the problem that the actual value of the column does not appear, it's a problem of nanmean not having a weighting option.

Another numerical example:

     x      y
0   NaN   18.0
1   NaN   21.0
2   NaN   38.0
3  56.0  150.0
4  65.0  154.0

Here we would wnat to just return the weighted average of the two last rows and ignore the other rows that contain nan.

Nickpick
  • 6,163
  • 16
  • 65
  • 116

2 Answers2

2

For me working implemented this solution:

def f(x):
    indices = ~np.isnan(x)
    return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])

df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))

print (df)
           sum  weighted_avg
four                        
bar  -2.942607      0.648173
foo   1.086501     -1.086525

EDIT:

def f(x):
    indices = ~np.isnan(x)
    if indices.all():
        return np.average(x[indices], weights=df.loc[x.index[indices], 'two'])
    else:
        return np.nan
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • that looks great. Any way to avoid zero weights DivisionByzero errors? – Nickpick Jun 18 '20 at 08:09
  • I think it would be better to ensure that the weights have at least a value of 1e-7. Can this be achieved with a max? Also, can it be done all in a lambda? `df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', lambda x: np.average(x[~np.isnan(x)], weights=df.loc[x.index[~np.isnan(x)], 'two'])))`. Where would that `max(weight, 1e-7)` have to go? – Nickpick Jun 18 '20 at 08:23
  • @Nickpick - I use cistom function for avoid 2 times test missing values ;) For second need some test. – jezrael Jun 18 '20 at 08:30
  • @Nickpick - For second I test sample data with change first `foo` to `foo1` in `four` column. So if aggregate problem is no values - empty Series in `x[indices]`, empty index in `x.index[indices]`, so cannot be normalized. Maybe I have bad data for test, please check it. – jezrael Jun 18 '20 at 08:36
  • the problem with this solution is that it returns nan if there are any nan values, instead of just ignoring the nan values. How can this be achieved? How can we avoid the `return np.nan`? Why would indices.all() still not be true after `~np.isnsn(x)`? – Nickpick Jun 19 '20 at 09:26
  • @Nickpick - Not sure if understand, is possible change data sample for test it? It seems my change first `foo` to `foo1` is not what need. – jezrael Jun 19 '20 at 09:29
  • added another numerical example. It has some nan so `indices.all()` will be false and it will just return np.nan instead of the weighted average of the two last rows. – Nickpick Jun 19 '20 at 09:37
  • @Nickpick - what are groups in added data? – jezrael Jun 19 '20 at 10:09
  • proposed a slightly improved solution, see separate answer. Based on your original answer. thanks again! – Nickpick Jun 19 '20 at 17:37
-1

This appears to be more robust:

def f(x):
    indices = (~np.isnan(x)) & (~np.isnan(df[weight_column]))[x.index]
    try:
        return np.average(x[indices], weights=df.loc[x.index[indices], weight_column])
    except ZeroDivisionError:
        return np.nan

df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))
Nickpick
  • 6,163
  • 16
  • 65
  • 116