I want to create a group column attribute where each groups cumulative sum is as equal as possible between the groups. The row order must not change.
import pandas as pd
import numpy as np
data = {"id":[1,2,3,4,5,6,7,8,9,10],
"area":[489.8,1099,1004,37.2,371.3,2390.5,2500,2500,1298.7,1125.7]}
df = pd.DataFrame(data)
id area
0 1 489.8
1 2 1099.0
2 3 1004.0
3 4 37.2
4 5 371.3
5 6 2390.5
6 7 2500.0
7 8 2500.0
8 9 1298.7
9 10 1125.7
numgroups = 3 #I want this number of groups
group_area_target = df.area.sum()/numgroups
#4272
My goal is to create the group column:
#df["group"] = [1,1,1,1,1,2,2,3,3,3]
# id area group
# 0 1 489.8 1
# 1 2 1099.0 1
# 2 3 1004.0 1
# 3 4 37.2 1
# 4 5 371.3 1
# 5 6 2390.5 2
# 6 7 2500.0 2
# 7 8 2500.0 3
# 8 9 1298.7 3
# 9 10 1125.7 3
#sums = df.groupby("group")["area"].sum()
# group
# 1 3001.3
# 2 4890.5
# 3 4924.4
#The total deviation from target is (sums-group_area_target).abs().sum() 2541
I found this question and answer which roughly suggests:
a = df.area.values
shift_num = group_area_target*np.arange(1, numgroups)
idx = np.searchsorted(a.cumsum(), shift_num,'right').tolist()
for e, i in enumerate(idx,1):
df.loc[i, "group"] = e
sums2 = df.groupby("group")["area"].sum()
#The total deviation is (sums2-group_area_target).abs().sum() 3695
df.group = df.group.bfill().fillna(df.group.max()+1)
id area group
0 1 489.8 1.0
1 2 1099.0 1.0
2 3 1004.0 1.0
3 4 37.2 1.0
4 5 371.3 1.0
5 6 2390.5 1.0
6 7 2500.0 2.0
7 8 2500.0 2.0
8 9 1298.7 3.0
9 10 1125.7 3.0
#The total deviation is (sums2-group_area_target).abs().sum() 3695
So the result is less optimal than the manual assignment of groups. How can I automate finding the best solution?