3

I have a dataset with weights for each observation and I want to prepare weighted summaries using groupby but am rusty as to how to best do this. I think it implies a custom aggregation function. My issue is how to properly deal with not item-wise data, but group-wise data. Perhaps it means that it is best to do this in steps rather than in one go.

In pseudo-code, I am looking for

#first, calculate weighted value
for each row:
  weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
  sum(weighted jobs)/sum(weight)

I am not sure how to work the "for each city"-part into a custom aggregate function and get access to group-level summaries.

Mock data:

import pandas as pd
import numpy as np
np.random.seed(43)

## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})
ako
  • 3,569
  • 4
  • 27
  • 38

1 Answers1

7

Simply multiply the two columns:

In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']

Now you can groupby the city (and take the sum):

In [12]: df_city_sums = df_city.groupby('city').sum()

In [13]: df_city_sums
Out[13]: 
           jobs  weight  weighted_jobs
city                                  
oakland     362     690           7958
san mateo   367    1017           9026
sf          253     638           6209

[3 rows x 3 columns]

Now you can divide the two sums, to get the desired result:

In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]: 
city
oakland      21.983425
san mateo    24.594005
sf           24.541502
dtype: float64
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • definitely workable--great! I will be doing the weighting for many variables and would like to wrap into the group by function I could call on the fly; vary grouping levels, etc. Any way this can be done in one step (i.e. in the groupby call, passing a custom function with grouping level, variable to be weighted as arguments?) – ako Jan 22 '14 at 23:53
  • 1
    E.g. you could use groupby apply with `def f(x): return (1. * x['weight'] * x['jobs']).sum() / x['jobs'].sum()` but it will probably be less efficient than the above. – Andy Hayden Jan 23 '14 at 00:09