6

I need to distribute a set of data evenly over time based on historical data such that each digit appears an equal (or close to equal) number of times in each position over time. The problem is, given a list of orderings used in the past, that look like this (but could have any number of elements):

1,2,5,3,4
4,1,5,2,3
1,3,5,2,4
4,1,2,3,5
2,4,1,3,5
5,1,4,3,2
1,5,3,2,4
5,1,3,2,4
3,2,5,4,1
4,3,1,5,2

how can I find an ordering of the values that is the least used and will lead to a "more balanced" set of orderings. The obvious answer is I could group by and count them and pick the least used one, but the problem is the least used permutation may not have ever been used, for example here, the ordering "1,2,3,4,5" is a candidate for least used because it doesn't appear at all.

The simple answer seems to be to identify which position "1" appears in the least frequent and set that position to "1" and so on for each digit. I suspect that works, but I feel like there's a more elegant solution that I haven't considered potentially with cross joins so that all possible combinations are included.

any ideas?

Ani
  • 111,048
  • 26
  • 262
  • 307
powlette
  • 1,800
  • 20
  • 41
  • why is "1,2,3,4,5" not good - it feels to me like you want "randomness" - but this is a rather subjective thing. Your idea won't work because what if the *least position* of 1 and 3 are the same? Anyhow just generate random permutations - sooner or later you have them all sampled evenly ;) – Random Dev Sep 22 '11 at 15:10
  • 6
    The "any number of elements" makes your problem essentially impossible. If there are n elements then there are n! orderings. For n=5 that's only 120 possibilities, but for n=15 there are over a trillion possibilities. Unless your data set has multiple trillion items in it **almost all of the trillion possible orderings are going to have zero examples in the data set**. – Eric Lippert Sep 22 '11 at 15:11
  • 1
    does the generated numbers need to be random ? i find that sequential is a pretty even distribution over time... – David Chan Sep 22 '11 at 15:12
  • I should have not said "any number of elements" - there won't be more than 10 total at any time. I agree that over time just choosing a random order is good. Problem is, it needs to be evenly distributed over the long run and the short run. Every selection needs to make it more balanced if possible as we can't wait for it to even out in the long run. – powlette Sep 22 '11 at 15:25
  • Maybe keep a running tally of the "average" ordering so far, and each time you want to add a new ordering to the list, you could find the ordering that is furthest away from the average. To do that, you could use Euclidean distance and the heuristic that you want to greedily choose matching elements whose difference is maximal... so if the average is (1.2, 2.3, 2.5) for numbers 1,2,3, the best choice would be (3, 2, 1) for square distance 5.58... think this would be amortized O(k^3 n) where your k = 5 and n is how many new orderings you add. – Patrick87 Sep 22 '11 at 15:30

3 Answers3

1

What you have here is a histogram leveling problem.

Consider the problem from this perspective: you have a set of N histograms that represent the frequency of occurrence of the value N values over a discrete range {1..N}. What you want to do is to add a new set of values to your population of data that shifts the all histograms closer to being level. Given the nature of your problem, we know that each value will, overall, appear the same number of times as every other value.

One way to do so, is to find which values N has the lowest frequency of occurence in any position - and assign it that position. Next, in the remaining histograms, find the next value with the lowest frequency of occurence in any position, and assign that value to that position. Continue repeating this process until all values have been assigned a unique position. This gives you your next set of values. You can now iteratively repeat this operation to continue generating new value sets that will attempt to re-balance the distribution of values with each iteration.

If you maintain the histograms as you distribute values, this becomes a relatively efficient operation (you don't have to constantly re-scan the data set).

Keep in mind, however, that for any sufficiently small population of values, you will always be "out of balance" to some degree. There's no way around this.

LBushkin
  • 129,300
  • 32
  • 216
  • 265
0

I presume that you have a way to generate a random permutation (e.g. Most efficient way to randomly "sort" (Shuffle) a list of integers in C#). Given that, one suggestion to generate a single new ordering is as follows:

1) Generate two random permuations

2) Keep whichever one of them would even out the imbalance the most.

One measure of balance would be to think of the list of all of the counts of digit frequencies at each position as a vector, which, in the case of perfect balance, would have each element the same. The imbalance would then be the length of the vector you get by subtracting off that perfect vector. By choosing between two random permutations you will pick a permutation from a distribution whose mean vector points in a direction opposite to the current imbalance, so you should tend to correct it while still producing a random-ish sequence of permutations.

Community
  • 1
  • 1
mcdowella
  • 19,301
  • 2
  • 19
  • 25
0

If the total number of combinations is small enough there's an approach I used on a similar problem long ago:

Maintain a pool of choices that is periodically replenished.

In your example you have 120 possible permutations. Make an array of 120 elements, assign each an initial value of say 5. When you need a random value you pick from this pool, the number in the bin being the weight given to that bin. (At the start the bins sum to 600. Pick a random from 1 to 600, subtract bins from it until <= 0. The bin you just subtracted is your result.) When an entry is picked decrement that bin by one. Once you've made 120 draws from the pile add 1 to every bin.

Obviously this becomes impractical if the total number of possibilities is too high.

Loren Pechtel
  • 8,945
  • 3
  • 33
  • 45