Divide set into subsets with equal number of elements

Question

For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).

Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?

This problem statement resembles the partition problem http://en.wikipedia.org/wiki/Partition_problem which is NP-complete. — Narendra Yadala, Sep 24 '11 at 13:26
That is what I thought but what I need is NOT an optimal division but only a rough approximation. Can anyone suggest how should I design an iteration to examine all possible divisions in the case described above? How many combinations would that be? — twowo, Sep 24 '11 at 13:28
Please have a look at this SO question http://stackoverflow.com/questions/4803668/3-partition-problem — Narendra Yadala, Sep 24 '11 at 13:34
Since the set is only 240 pictures, just use the `random.sample` solution over and over again until you get a result that meets the criteria. You'll just need to write code to apply the "approximately balanced" test. — agf, Sep 24 '11 at 13:45
This seems very similar to the question asked and answered here: http://stackoverflow.com/questions/4462531/grouping-objects-to-achieve-a-similar-mean-property-for-all-groups -c — cytochrome, Sep 25 '11 at 13:05

score 5 · Accepted Answer · answered Sep 24 '11 at 13:16

5

If I understand correctly your problem, you might use random.sample() in python:

import random

pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset

answered Sep 24 '11 at 13:16

Michał Šrajer

30,364
7
62
85

3

The whole point is they can't be random. Each group needs to have a roughly equal proportion of certain qualities. `random.sample` will only give you (approximately) that for large groups -- much more than the 240 the OP mentions. Throw it in a loop, however, and then check if the result meets the criteria, and it could work (since the group is only 240, you can do this many, many times without a performance problem). – agf Sep 24 '11 at 13:43
Thanks a lot! I think It will solve the problem. I am just curious how many combinations would that be if I wanted to make it brute-force exhaustive search? – twowo Sep 24 '11 at 14:25

score 2 · Answer 2 · answered Sep 24 '11 at 14:00

I would tackle this as follows:

Divide into 3 equal subsets.
Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.

Ramnath · Answer 3 · 2011-09-24T16:50:57.187

1

You can easily do this using the plyr library in R. Here is the code.

require(plyr)

# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))

# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))

edited Sep 24 '11 at 16:50

answered Sep 24 '11 at 16:45

Ramnath

54,439
16
125
152

score 1 · Answer 4 · answered Sep 25 '11 at 20:38

In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:

120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140

Clearly, this is not an option :)

score 0 · Answer 5 · answered Sep 26 '11 at 02:05

Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.

Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.

Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.

The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.

Divide set into subsets with equal number of elements

5 Answers5