Applying custom functions to groupby objects pandas

Question

I have the following pandas dataframe.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "bird_type": ["falcon", "crane", "crane", "falcon"],
        "avg_speed": [np.random.randint(50, 200) for _ in range(4)],
        "no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
        "reliability_of_data": [np.random.rand() for _ in range(4)],
    }
)

# The dataframe looks like this. 
   bird_type    avg_speed   no_of_birds_observed    reliability_of_data
0   falcon        66            3                       0.553841
1   crane         159           8                       0.472359
2   crane         158           7                       0.493193
3   falcon        161           7                       0.585865

Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.

def func(data, numbers):
    ans = 0
    for a, b in zip(data, numbers):
        ans = ans + a*b
    ans = ans / sum(numbers)
    return ans

How can I apply the function of func to both average speed and reliability variables?

I expect the answer to be a dataframe like follows

    bird_type   avg_speed        no_of_birds_observed  reliability_of_data
0   falcon      132.5                 10                   0.5762578   
# how       (66*3 + 161*7)/(3+7)    (3+10)     (0.553841×3+0.585865×7)/(3+7)
1   crane       158.53                15                   0.4820815
# how      (159*8 + 158*7)/(8+7)    (8+7)     (0.472359×8+0.493193×7)/(8+7)

I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.

What are variables `data, numbers` in function? It is mapped some columns? — jezrael, Jan 23 '23 at 10:41
`data` is a list of numbers and `numbers` is a list of corresponding frequency values @jezrael. In my case, for average speed calculation, i expect `data = [66, 161]` and `numbers = [3, 7]` for bird_type falcon and `data = [159, 158]` and `numbers = [8,7]` for bird_type crane — berinaniesh, Jan 23 '23 at 10:42
`data` are numbers from `avg_speed` ? `numbers` are from `reliability_of_data` ? — jezrael, Jan 23 '23 at 10:42
For reliability calculation, for falcon, `data = [0.553841, 0.585865]` and `numbers = [3, 7]` and for crane, `data = [0.472359, 0.493193]` and `numbers = [8,7]` — berinaniesh, Jan 23 '23 at 10:48
There was a small error in `func`. Just now corrected it. @jezrael — berinaniesh, Jan 23 '23 at 11:40

score 2 · Answer 1 · answered Jan 23 '23 at 12:06

2

Don't use a function with apply, rather perform a classical aggregation:

cols = ['avg_speed', 'reliability_of_data']

# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
       .combine_first(df)
       .groupby('bird_type').sum()
      )

# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)

Output:

            avg_speed  no_of_birds_observed  reliability_of_data
bird_type                                                       
crane      158.533333                    15             0.482082
falcon     132.500000                    10             0.576258

answered Jan 23 '23 at 12:06

mozway

194,879
13
39
75

This indeed looks simpler. Should I accept this answer? – berinaniesh Jan 23 '23 at 12:11
Why does the order change? Due to the alphabetical order of the bird names? – berinaniesh Jan 23 '23 at 12:13
1

@berinaniesh do as you wish, I believe this is more straightforward, but it's your choice ultimately. If you want to keep the original order, add `sort=False` in the `groupby` – mozway Jan 23 '23 at 12:13
I think I am going to accept @jezrael's answer because, this is not my actual dataset. I am not going to take weighted average and I have to use other functions. That answer is more generalized and I can use it for my actual data. But strictly as far as this question is concerned, I do think this answer is simpler. – berinaniesh Jan 23 '23 at 12:52
Always though SO to be objective, unopinionated, etc. Laughs on me I guess. To anyone reading this, I have nothing personal against or for towards anyone for me to accept one answer over another. As stated above, I think @jezrael's answer to be more generic and usable to my case (with the lambda function). Would have been happier if it were a normal function, but that solution is more generic. – berinaniesh Jan 23 '23 at 13:00
your comment on how to prevent sorting seems to have disappeared, can you add it back? – berinaniesh Jan 23 '23 at 13:37
@berinaniesh I believe it's still here, see my first comment – mozway Jan 23 '23 at 15:08

jezrael · Accepted Answer · 2023-01-23T12:02:07.937

0

If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:

#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)


f = lambda x: np.average(x,  weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
          .agg(avg=('avg_speed',f),
               no_of_birds=('no_of_birds_observed','sum'),
               reliability_of_data=('reliability_of_data', f)))
print (df1)
  bird_type         avg  no_of_birds  reliability_of_data
0    falcon  132.500000           10             0.576258
1     crane  158.533333           15             0.482082

edited Jan 23 '23 at 12:02

answered Jan 23 '23 at 10:39

jezrael

822,522
95
1,334
1,252

@berinaniesh - So it means your function is wrong? – jezrael Jan 23 '23 at 11:34
Probably, in the expected answer, I have mentioned how we arrive at the values. If it is wrong, can you please give me the correct solution? – berinaniesh Jan 23 '23 at 11:36
@berinaniesh - Can you check now? – jezrael Jan 23 '23 at 11:45
There is something still wrong with the weighted average of speed. Falcon's speed should be 132.5 `(66*3+161*7)/10` We get crane's speed approximately right because both values are 158 and 159 and the weights are 8 and 7. @jezrael – berinaniesh Jan 23 '23 at 11:57
@berinaniesh - thank you for patience, now it is correct. – jezrael Jan 23 '23 at 12:03
The weighted average of "reliability_of_data" is on exactly right. Can you do the same for "bird_type" as well. @jezrael. – berinaniesh Jan 23 '23 at 12:03
1

You used `agg`, and lambda functions, passed the lambda function as an argument, etc. @mozway multiplied values and divided values. For me, it was easier to understand. And always, we ourselves can understand our code better than anyone else's. With your expertise and experience, I guess you can relate. I really don't have a preference for any one answer or any one person. It is all my opinion. You have answered many of my questions, I remember. I'm always grateful to you. I think we can move on. – berinaniesh Jan 23 '23 at 12:29

Applying custom functions to groupby objects pandas

2 Answers2