0

I want to create a group column attribute where each groups cumulative sum is as equal as possible between the groups. The row order must not change.

import pandas as pd
import numpy as np

data = {"id":[1,2,3,4,5,6,7,8,9,10],
        "area":[489.8,1099,1004,37.2,371.3,2390.5,2500,2500,1298.7,1125.7]}
df = pd.DataFrame(data)

   id    area
0   1   489.8
1   2  1099.0
2   3  1004.0
3   4    37.2
4   5   371.3
5   6  2390.5
6   7  2500.0
7   8  2500.0
8   9  1298.7
9  10  1125.7
numgroups = 3 #I want this number of groups
group_area_target = df.area.sum()/numgroups
#4272

My goal is to create the group column:

#df["group"] = [1,1,1,1,1,2,2,3,3,3]
#     id    area  group
# 0   1   489.8      1
# 1   2  1099.0      1
# 2   3  1004.0      1
# 3   4    37.2      1
# 4   5   371.3      1
# 5   6  2390.5      2
# 6   7  2500.0      2
# 7   8  2500.0      3
# 8   9  1298.7      3
# 9  10  1125.7      3

#sums = df.groupby("group")["area"].sum()
# group
# 1    3001.3
# 2    4890.5
# 3    4924.4

#The total deviation from target is (sums-group_area_target).abs().sum() 2541

I found this question and answer which roughly suggests:

a = df.area.values
shift_num = group_area_target*np.arange(1, numgroups)
idx = np.searchsorted(a.cumsum(), shift_num,'right').tolist()

for e, i in enumerate(idx,1):
    df.loc[i, "group"] = e

sums2 = df.groupby("group")["area"].sum()
#The total deviation is (sums2-group_area_target).abs().sum() 3695
    
df.group = df.group.bfill().fillna(df.group.max()+1)
   id    area  group
0   1   489.8    1.0
1   2  1099.0    1.0
2   3  1004.0    1.0
3   4    37.2    1.0
4   5   371.3    1.0
5   6  2390.5    1.0
6   7  2500.0    2.0
7   8  2500.0    2.0
8   9  1298.7    3.0
9  10  1125.7    3.0

#The total deviation is (sums2-group_area_target).abs().sum() 3695

So the result is less optimal than the manual assignment of groups. How can I automate finding the best solution?

ouroboros1
  • 9,113
  • 3
  • 7
  • 26
BERA
  • 1,345
  • 3
  • 16
  • 36
  • It's difficult to get the "best" result in a vectorial way. Do you want the minimal deviation between sums? You'll have to compute many combinations are compare them. – mozway Aug 18 '23 at 17:49

1 Answers1

1

If you really want the best combination, you need to test all possible splits. For that itertools.combinations is useful. I'm using the minimal standard deviation as indication of the best split.

from itertools import combinations, pairwise

N = 3
def group(s):
    def f(c):
        pairs = pairwise((0, *c, len(s)))
        return np.std([s.iloc[slice(*p)].sum() for p in pairs])
    best = min(combinations(range(1, len(s)-1), N-1), key=f)
    out = np.zeros(len(s), dtype=int)
    out[list(best)] = 1
    return np.cumsum(out)+1

df['group'] = group(df['area'])

Output:

   id    area  group
0   1   489.8      1
1   2  1099.0      1
2   3  1004.0      1
3   4    37.2      1
4   5   371.3      1
5   6  2390.5      2
6   7  2500.0      2
7   8  2500.0      3
8   9  1298.7      3
9  10  1125.7      3
mozway
  • 194,879
  • 13
  • 39
  • 75