3

Given the following dataframe:

import pandas as pd
d=pd.DataFrame({'Age':[18,20,20,56,56],'Race':['A','A','A','B','B'],'Response':[3,2,5,6,2],'Weight':[0.5,0.5,0.5,1.2,1.2]})
d
    Age     Race    Response    Weight
0   18      A       3           0.5
1   20      A       2           0.5
2   20      A       5           0.5
3   56      B       6           1.2
4   56      B       2           1.2

I know that I can apply a group-by to get the count by age and race like this:

d.groupby(['Age','Race'])['Response'].count()
Age  Race
18   A       1
20   A       2
56   B       2
Name: Response, dtype: int64

But I'd like to use the "Weight" column to weight the cases such that the first 3 rows will count as 0.5 instead of 1 each and the last two will count as 1.2. So, if grouping by age and race, I should have the following:

Age  Race
18   A       0.5
20   A       1
56   B       2.4
Name: Response, dtype: int64

This is similar to using the "Weight Cases" option in SPSS. I know it's possible in R and I've seen a promising library in Python (though the current build is failing) here.

And PySal (not sure if it's applicable here)

...but I'm wondering if it can just be done somehow in the group-by.

double-beep
  • 5,031
  • 17
  • 33
  • 41
Dance Party
  • 3,459
  • 10
  • 42
  • 67

2 Answers2

3

If I understand correctly, you're just looking for .sum() with the weights.

d.groupby(['Age', 'Race']).Weight.sum()

## Age  Race
## 18   A       0.5
## 20   A       1.0
## 56   B       2.4
## Name: Weight, dtype: float64
miradulo
  • 28,857
  • 6
  • 80
  • 93
2

Not exactly obvious... but a fun way to do it with pd.factorize and np.bincount

f, u = pd.factorize(
    [tuple(row) for row in d[['Age', 'Race']].values.tolist()]
)

pd.Series(
    np.bincount(f, d.Weight.values),
    pd.MultiIndex.from_tuples(u, names=['Age', 'Race'])
)

Age  Race
18   A       0.5
20   A       1.0
56   B       2.4
dtype: float64
piRSquared
  • 285,575
  • 57
  • 475
  • 624