Consider this DataFrame given with many colums, but it has a feature defined in the column 'feature'
and some values in the column 'values'
.
I want in an extra column the relative values per feature (group) The desired result is manually precalculated by me in the column 'desired'
df = pd.DataFrame(
data={
'feature': [1, 1, 2, 3, 3, 3],
'values': [30.0, 20.0, 25.0, 100.0, 250.0, 50.0],
'desired': [0.6, 0.4, 1.0, 0.25, 0.625, 0.125],
'more_columns': range(6),
},
)
Which leads to the DataFrame
feature values desired more_columns
0 1 30.0 0.600 0
1 1 20.0 0.400 1
2 2 25.0 1.000 2
3 3 100.0 0.250 3
4 3 250.0 0.625 4
5 3 50.0 0.125 5
So for the group defined by feature 1
the desired values are 0.6 and 0.4 (because 0.6 = 30 / (20+30)
) and so on.
I came to these values manually using
for feature, group in df.groupby('feature'):
rel_values = (group['values'] / group['values'].sum()).values
df[df['feature'] == feature]['result'] = rel_values # no effect
print(f'{feature}: {rel_values}')
# which prints:
1: [0.6 0.4]
2: [1.]
3: [0.25 0.625 0.125]
# but df remains unchanged
I believe that there must be a smart and fast way in pandas to accomplish this.