1

I need to group a Pandas dataframe by date, and then take a weighted average of given values. Here's how it's currently done using the margin value as an example (and it works perfectly until there are NaN values):

df = orders.copy()
# Create new columns as required
df['margin_WA'] = df['net_margin'].astype(float)    # original data as str or Decimal

def group_wa():
        return lambda num: np.average(num, weights=df.loc[num.index, 'order_amount'])

agg_func = {
        'margin_WA': group_wa(),    # agg_func includes WAs for other elements
    }

result = df.groupby('order_date').agg(agg_func)

result['margin_WA'] = result['margin_WA'].astype(str)

In the case where 'net_margin' fields contain NaN values, the WA is set to NaN. I can't seem to be able to dropna() or filtering by pd.notnull when creating new columns, and I don't know where to create a masked array to avoid passing NaN to the group_wa function (like suggested here). How do I ignore NaN in this case?

Community
  • 1
  • 1
user2524282
  • 305
  • 1
  • 4
  • 13

1 Answers1

4

I think a simple solution is to drop the missing values before you groupby/aggregate like:

 result = df.dropna(subset='margin_WA').groupby('order_date').agg(agg_func)

In this case, no indices containing missings are passed to your group_wa function.

Edit

Another approach is to move the dropna into your aggregating function like:

def group_wa(series):
    dropped = series.dropna()
    return np.average(dropped, weights=df.loc[dropped.index, 'order_amount'])

agg_func = {'margin_WA': group_wa}
result = df.groupby('order_date').agg(agg_func)
pansen
  • 6,433
  • 4
  • 19
  • 32
  • Fair. I was hoping not to because I want to include an order with a NaN net margin for calculating gross margin (non-NaN) for example, so it would require creating different dataframes for different values. Is there a way to avoid that? – user2524282 Mar 27 '17 at 16:13
  • @user2524282 Updated the answer. – pansen Mar 27 '17 at 16:54
  • 2
    Or alternatively, in one line, `group_wa = lambda x: np.average(x.dropna(), weights=df.loc[x.dropna().index, 'order_amount'])` Thank you though; I've been looking for a solution to this problem for a long time, and this is the best solution I've seen! – ladylala Nov 30 '20 at 19:48