Group a column such that its sum is approximately equal in each group

Question

What's the easiest way to sort evenly distributed values into a predefined number of groups?

data = {'impact':[10,30,20,10,90,60,50,40]}
df = pd.DataFrame(data,index=['a','b','c','d','e','f','g','h'])

print df

   impact
a      10
b      30
c      20
d      10
e      90
f      60
g      50
h      40

numgroups = 4

group_targetsum = round(df.impact.sum() / numgroups, -1)

print group_targetsum

80.0

In the case above, I'd like to create 4 groups from df. The only sorting criteria is that the sum of impact in each group should be approximately equal to group_targetsum. impact sum can be above or below group_targetsum within a reasonable margin.

Ultimately, I'd like to separate these groups into their own dataframes, preserving index. Resulting in something like this:

print df_a

   impact
e      90

print df_b

   impact
c      20
f      60

print df_c

   impact

a      10
d      10
g      50

print df_d

   impact

b      30
h      40

Resulting dataframes don't need to be exactly this, just as long as they sum as close as possible to group_targetsum.

What is your desired output for the example you give with `numgroups = 4`? — 3novak, Jan 13 '17 at 17:14
It seems this solution is what you would like, which is derived from @Divakar's solution to this question: https://stackoverflow.com/a/67816898/943773 — ryanjdillon, Nov 14 '22 at 08:51

score 4 · Answer 1 · answered Jan 13 '17 at 17:19

4

Assuming fairly similar values in the series, here's an approach using searchsorted -

In [150]: df
Out[150]: 
   impact
a      10
b      30
c      20
d      10
e      90
f      60
g      50
h      40

In [151]: a = df.values.ravel()

In [152]: shift_num = group_targetsum*np.arange(1,numgroups)

In [153]: idx = np.searchsorted(a.cumsum(), shift_num,'right')

In [154]: np.split(a, idx)
Out[154]: [array([10, 30, 20, 10]), array([90]), array([60]), array([50, 40])]

answered Jan 13 '17 at 17:19

Divakar

218,885
19
262
358

I'm glad I'm thinking along the same lines. I messed it up because I tried to sharpen the line. Instead of `[80, 160, 240, 320]` I used `[310/4, 310/2, 310 * 3/4, 310]`. In this case, `searchsorted` identifies the same index twice. Basically try it with `group_targetsum = a.sum() / 4`. The real answer to this needs to ultimately include some fuzzy decisions around corner cases. – piRSquared Jan 13 '17 at 17:53
@piRSquared Agreed, this might mess up with big numbers at potential interval points. – Divakar Jan 13 '17 at 17:54
Ha, I played around with this awhile before exploring your answer and now see that I have conceptually done things the same as you did for the most part (just doing it in pandas instead of numpy). However, I think you might like how I did the adjustment for midpoint which I believe will usually group things more evenly. Without the midpoint adjustment, I would get the same answer as what you would get if you left off the 'right' option from `searchsorted` – JohnE Jan 14 '17 at 18:40
@JohnE Yeah the moment I saw yours, it looked similar :) Well, I need to look into your midpoint thing in more detail, but I will take your word on it for now. – Divakar Jan 14 '17 at 18:56
Thanks. I see how that groups the values, but not sure how I can use this while preserving the original index. – j.k Jan 17 '17 at 14:51

JohnE · Answer 2 · 2017-01-14T18:36:26.883

Conceptually we'd just like to use a weighted version of qcut, but that doesn't exist in pandas at this time. Nevertheless, we can accomplish the same thing by combining cumsum and cut. The cumsum essentially gives us the weighting, and we then slice it up with cut.

(Note about 'csum_midpoint': without the midpoint adjustment, we'll end up putting things into groups based on where it begins (in a cumulative sense) and hence end up with a bias towards binning in the higher groups. The midpoint adjustment can't make things perfectly even, but it helps. I believe this answer is mathematically the same as @Divakar's with the exception of my use of midpoint here and his use of 'right'.)

df['csum'] = df['impact'].cumsum()
df['csum_midpoint'] = (df.csum + df.csum.shift().fillna(0)) / 2.

df['grp'] = pd.cut( df.csum_midpoint, np.linspace(0,df['impact'].sum(),numgroups+1 ))
df.groupby( df.grp )['impact'].sum()

grp
(0, 77.5]       70
(77.5, 155]     90
(155, 232.5]    60
(232.5, 310]    90
Name: impact, dtype: int64

df

   impact  csum  csum_midpoint           grp
a      10    10            5.0     (0, 77.5]
b      30    40           25.0     (0, 77.5]
c      20    60           50.0     (0, 77.5]
d      10    70           65.0     (0, 77.5]
e      90   160          115.0   (77.5, 155]
f      60   220          190.0  (155, 232.5]
g      50   270          245.0  (232.5, 310]
h      40   310          290.0  (232.5, 310]

Group a column such that its sum is approximately equal in each group

2 Answers2

Linked