Consider two DataFrame, A and B, where
- A has columns name, ...,
- B has columns name, aname, ...
One entry in A has any number (0, 1, or more than 1) of corresponding entries in B. These are those where b.aname = a.name for some a in A and b in B.
A could be houses and B people in each house. A house could be empty or some people living there.
I want to do some calculations on table B grouping them using groupby/aggregate and then setting back the results to table A.
Example: Each person has a body height. Calculate the average body height of inhabitants of houses.
B_grouped_agg = B.groupby('aname')['some_property'].agg(['count', 'sum', 'mean'])
and then something like
A['new_aggregated_property'] = B_grouped_agg['mean']
Two problems arise:
- It could happen that the groupby/aggregate list is smaller than A when for some a in A there is no b in B so that b.aname = a . How to default for missing items, that should pass an empty set to agg.
- How to correctly sort the list
B_grouped_agg
? Does it match the order of A (in case the lengths match, so when case 1 does not occur)