4

I have a data frame:

import pandas as pd
import numpy as np

df=pd.DataFrame.from_items([('STAND_ID',[1,1,2,3,3,3]),('Species',['Conifer','Broadleaves','Conifer','Broadleaves','Conifer','Conifer']),
                             ('Height',[20,19,13,24,25,18]),('Stems',[1500,2000,1000,1200,1700,1000]),('Volume',[200,100,300,50,100,10])])

   STAND_ID      Species  Height  Stems  Volume
0         1      Conifer      20   1500     200
1         1  Broadleaves      19   2000     100
2         2      Conifer      13   1000     300
3         3  Broadleaves      24   1200      50
4         3      Conifer      25   1700     100
5         3      Conifer      18   1000      10

I want to group by STAND_ID and Species, apply a weighted mean on Height and Stems with Volume as weight and unstack.

So i try:

newdf=df.groupby(['STAND_ID','Species']).agg({'Height':lambda x: np.average(x['Height'],weights=x['Volume']),
                                              'Stems':lambda x: np.average(x['Stems'],weights=x['Volume'])}).unstack()

Which give me error:

builtins.KeyError: 'Height'

How can i fix this?

BERA
  • 1,345
  • 3
  • 16
  • 36
  • Also seems to be answered here: https://stackoverflow.com/questions/31521027/groupby-weighted-average-and-sum-in-pandas-dataframe – David Faivre Apr 30 '18 at 13:23

1 Answers1

8

Your error is because you can not do multiple series/column operations using agg. Agg takes one series/column as a time. Let's use apply and pd.concat.

g = df.groupby(['STAND_ID','Species'])
newdf = pd.concat([g.apply(lambda x: np.average(x['Height'],weights=x['Volume'])), 
                   g.apply(lambda x: np.average(x['Stems'],weights=x['Volume']))], 
                   axis=1, keys=['Height','Stems']).unstack()

Edit a better solution:

g = df.groupby(['STAND_ID','Species'])
newdf = g.apply(lambda x: pd.Series([np.average(x['Height'], weights=x['Volume']), 
                             np.average(x['Stems'],weights=x['Volume'])], 
                                    index=['Height','Stems'])).unstack()

Output:

              Height                  Stems             
Species  Broadleaves    Conifer Broadleaves      Conifer
STAND_ID                                                
1               19.0  20.000000      2000.0  1500.000000
2                NaN  13.000000         NaN  1000.000000
3               24.0  24.363636      1200.0  1636.363636
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • why do you say that the second solution is better please ? – D.L Jun 25 '21 at 12:11
  • @DL Honestly, I did this answer just 4 years ago, I am no longer convinced the second is any better. I would need to run timings and consider other methods/recipes I know now. – Scott Boston Jun 25 '21 at 13:20
  • 1
    I created two dataframes with `groupby()` and then implemented `pd.join()`, which i presume is most similar to the `concat()` method that you initially asseted. Performance (approximate): It computes 10k rows in 0.1 seconds. – D.L Jun 25 '21 at 13:58