12

I have te following pandas dataframe:

data_df = pd.DataFrame({'ind':['la','p','la','la','p','g','g','la'],
                        'dist':[10.,5.,7.,8.,7.,2.,5.,3.],
                        'diff':[0.54,3.2,8.6,7.2,2.1,1.,3.5,4.5],
                        'cas':[1.,2.,3.,4.,5.,6.,7.,8.]})  

that is

    cas  diff  dist ind
0    1  0.54    10  la
1    2  3.20     5   p
2    3  8.60     7  la
3    4  7.20     8  la
4    5  2.10     7   p
5    6  1.00     2   g
6    7  3.50     5   g
7    8  4.50     3  la

I need to compute the weighted average of all the columns where the weights are in the 'dist' column and group the values by 'ind'.

For example for 'ind'='la' and the 'diff' column:

((10*0.54)+(8.60*7)+(7.20*8)+(4.50*3))/(10+7+8+3) = 4.882143

The result I want to obtain is the following

        cas      diff
ind                    
 g    6.714286  2.785714
 la   3.107143  4.882143
 p    3.750000  2.558333

which is obtained by multiplying each value of each colums by the corrisponding value in the 'dist' column, sum the results with the same 'ind' and then dividing the result by the sum of all the 'dist' values corrisponding to the same ind.

I thought this would have been an easy task done by the dataframe 'groupby' method, but actually it's kind of tricky.

Can someone please help me?

Cecilia
  • 487
  • 3
  • 6
  • 14

1 Answers1

10

You may obtain within groups normalized weights by using transform:

>>> df['weight'] = df['dist'] / df.groupby('ind')['dist'].transform('sum')
>>> df['weight']
0    0.357143
1    0.416667
2    0.250000
3    0.285714
4    0.583333
5    0.285714
6    0.714286
7    0.107143
Name: weight, dtype: float64

Then, you just need to multiply these weight by the values, and take the sum:

>>> df['wcas'], df['wdiff'] = (df[n] * df['weight'] for n in ('cas', 'diff'))
>>> df.groupby('ind')[['wcas', 'wdiff']].sum()
         wcas     wdiff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333

Edit: with in-place mutation:

>>> backup = df.copy()     # make a backup copy to mutate in place
>>> cols = df.columns[:2]  # cas, diff
>>> df[cols] = df['weight'].values[:, None] * df[cols]
>>> df.groupby('ind')[cols].sum()
          cas      diff
ind                    
g    6.714286  2.785714
la   3.107143  4.882143
p    3.750000  2.558333
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • This is actually working! Thank you. The only problem is that the dataframe I wrote it's just an example, I'm working with big data and thousands of columns, so I was looking for a solution where I don't have to write down the columns names... – Cecilia Nov 11 '15 at 19:19
  • @Cecilia you can obtain a list of columns by using `df.columns`, and use that as in the edit – behzad.nouri Nov 11 '15 at 19:38
  • For numerical stability, I think it's better to calculate (X * WT).sum() / WT.sum() instead of X * (WT / WT.sum()) - you're doing one instead of N divisions. – quant_dev Mar 07 '17 at 11:21