0

I have a series of values that I want to split up into 5 bins, so that the sum of the values in the bins is approximately the same.

AD8_normal
075e5e2b-dd96-4097-8fc3-c3b82f610277    18
bd9a3e1f-b7a1-4ad9-ac8f-c6009dd5073a    18
b50fa764-1a1d-4760-bd6f-6efbba316adb    16
8cc1408b-9509-485a-8603-41895a53925d    14
3cb98c8c-918f-45fd-bd9f-11b0a6d19838    13
5fa9da0d-bfcb-48da-8bb3-e245e995e3e6    13
5950b2d8-b51a-4c84-86f6-40ee59ddf8c7    12
bbd07332-92a1-4f05-b7de-a0164b017343    12
9b67c710-f5ee-4a4b-ba80-eaf905a1a4fc    12
140f98ce-2ffe-4aca-accb-76f1cf2f08a9    12
177b0245-8ff4-4a34-b15b-1ddf909e0c9e    11
b5aaf5c4-fd3d-4781-9eaf-92e7dee4a234    11
e24f86ce-5cfe-487a-8f23-929fb7674a4a    11
b7ddaf74-adde-4bdf-9e07-f0d1a06176e7    11
5b442d58-78d8-4f3e-9d86-6782dd9945ca    10
acdcd702-dcae-45bc-b56f-1f50e4610c2a    10
c5ac2af7-e040-4d0f-b281-704349338463    10
10f815fa-16d4-4efb-a3e5-05b2f197688e    10
7098c542-7475-4030-87dd-6405952d179f    10
67566970-81f7-4f69-8765-2e1fbe242e62     9
7f2d7f97-da2e-408d-8f62-d46b27cc19fb     8
e0c43c9f-764f-44e9-86af-c80417253ba3     8
efcd4689-a4f1-4bcc-810a-57a7c3141f14     8
acbf916d-fafc-4f4d-8a2e-eeb3c35d4040     8
2fa4333d-dc1c-4f68-a745-c1cbdf90ed95     7
9b96a2c2-f3c5-479a-9666-1d4c44a17270     7
7e1070a8-8f96-440e-998f-409383a9faf1     7
90b1971d-c70f-495a-bdb0-cc3c5b67bbe1     7
Name: id, dtype: int64

I can randomly sample indices, in which case I would expect most bins to lead to similar total counts. However, there's obviously the chance that I accidentally sample the top N indices in a single bin, so that bin becomes 'overweight'. Is there a way to cut up this series into 5 parts (i.e., 5 sets of indices) so each part has a value sum that is approximately the same?

To reproduce the array:

idx = ['075e5e2b-dd96-4097-8fc3-c3b82f610277', 'bd9a3e1f-b7a1-4ad9-ac8f-c6009dd5073a', 'b50fa764-1a1d-4760-bd6f-6efbba316adb', '8cc1408b-9509-485a-8603-41895a53925d', '3cb98c8c-918f-45fd-bd9f-11b0a6d19838', '5fa9da0d-bfcb-48da-8bb3-e245e995e3e6', '5950b2d8-b51a-4c84-86f6-40ee59ddf8c7', 'bbd07332-92a1-4f05-b7de-a0164b017343', '9b67c710-f5ee-4a4b-ba80-eaf905a1a4fc', '140f98ce-2ffe-4aca-accb-76f1cf2f08a9', '177b0245-8ff4-4a34-b15b-1ddf909e0c9e', 'b5aaf5c4-fd3d-4781-9eaf-92e7dee4a234', 'e24f86ce-5cfe-487a-8f23-929fb7674a4a', 'b7ddaf74-adde-4bdf-9e07-f0d1a06176e7', '5b442d58-78d8-4f3e-9d86-6782dd9945ca', 'acdcd702-dcae-45bc-b56f-1f50e4610c2a', 'c5ac2af7-e040-4d0f-b281-704349338463', '10f815fa-16d4-4efb-a3e5-05b2f197688e', '7098c542-7475-4030-87dd-6405952d179f', '67566970-81f7-4f69-8765-2e1fbe242e62', '7f2d7f97-da2e-408d-8f62-d46b27cc19fb', 'e0c43c9f-764f-44e9-86af-c80417253ba3', 'efcd4689-a4f1-4bcc-810a-57a7c3141f14', 'acbf916d-fafc-4f4d-8a2e-eeb3c35d4040', '2fa4333d-dc1c-4f68-a745-c1cbdf90ed95', '9b96a2c2-f3c5-479a-9666-1d4c44a17270', '7e1070a8-8f96-440e-998f-409383a9faf1', '90b1971d-c70f-495a-bdb0-cc3c5b67bbe1']

vals = [18, 18, 16, 14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 9, 8, 8, 8, 8, 7, 7, 7, 7]

AD8_normal = pandas.Series(data=vals, index=idx)
Inkidu616
  • 389
  • 3
  • 17
  • what do you mean by "BIN" ? – Aven Desta Feb 18 '21 at 15:06
  • 1
    Sum all the values, divide by number of bins to get target bin size. Then look for elements that will give you sum of a target bin size. – Alex Feb 18 '21 at 15:06
  • Well, I don't really care how many indices end up in each 'section' (which I called bin), but the sum of the values at those indices should sum up to comparable totals, per section – Inkidu616 Feb 18 '21 at 15:07
  • As an example, suppose the indices are people and the values are a measure of IQ: I want to create 5 groups that have similarly summed IQs (and it's ok to have different amounts of people in there). – Inkidu616 Feb 18 '21 at 15:09
  • Try this one pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise') – Alex Feb 18 '21 at 15:12
  • Ah, this looks promising, thanks! Will try it out and see how the counts stack up. – Inkidu616 Feb 18 '21 at 15:15
  • [This](https://stackoverflow.com/questions/3420937/algorithm-to-find-which-number-in-a-list-sum-up-to-a-certain-number) seems related, as it is also a knapsack problem. You'd have to establish whether it has a solution first for the amount of bins you require and the sum as indicated by sum/#bins. The linked post should also be able to help you sort out the algorithm you have to implement. @Alex, dividing up into quantiles does not lead to the sums being equal, right? – Steven Feb 18 '21 at 15:23
  • How big is your real data you want to bin and the number of bins? – Ben.T Feb 18 '21 at 18:52
  • This is it. 5 bins. – Inkidu616 Feb 18 '21 at 20:04

1 Answers1

0

Order the values, biggest first. Put the first five values each in a new bin. Now you have five bins.

From now on, put each new value in the bin with the smallest total.

Dirk Roorda
  • 101
  • 4