I recently encountered an interesting question regarding the difference between .agg() & .apply() in Pandas Groupby().I read the great post from Pandas difference between apply() and aggregate() functions.
It clarified a lot, but still, I am a bit confused about when to use .agg() and when to use .apply().
Demo dataset:
import pandas as pd
import numpy as np
df_min = pd.DataFrame({"A":[0.0,0.0,np.nan,0.0,0.42832,np.nan,0.62747,0.69856],
"B":[0.42832,0.69856,0.75865,0.42832,0.62747,0.27024,0.42832,np.nan],
"C":[0,0,1,0,0,0,0,0]})
A B C
0 0.00000 0.42832 0
1 0.00000 0.69856 0
2 NaN 0.75865 1
3 0.00000 0.42832 0
4 0.42832 0.62747 0
5 NaN 0.27024 0
6 0.62747 0.42832 0
7 0.69856 NaN 0
The objective: Fill the np.nan via groupby statement.
My Code is listed below:
fill_na = lambda x: x.fillna(x.mean())
df_min.groupby('transportation_issues').apply(fill_na)
df_min.groupby('transportation_issues').agg(fill_na)
Now, when I applied .apply(), the code did its job and got the result. But when I use .agg(), the ValueError Occured as such:
ValueError: Shape of passed values is (3, 2), indices imply (2, 2)
So, my questions are:
1: Why .agg() did not work?
2: What should I do to make the user defined function works by applying .agg()?
3: When apply the user defined function on groupby(), when I should use .apply() & .agg(), respectively?
4: In groupby(), is it true that .apply() functions on whole dataset and agg() functions on the columns?
5: Under the hood, how .apply() & .agg() differentiate from each other?
Thank you so much for answering my question, and much appreciate for your help!