5

VERY IMPORTANT EDIT: All Ai are unique.

The Question

I have a list A of n unique objects. Each object Ai has a variable percentage Pi.

I want to create an algorithm that generates a new list B of k objects (k < n/2 and in most cases k is significantly less than n/2. E.g. n=231 , k=21). List B should have no duplicates and will be populated with objects originating from list A with the following restriction:

The probability that an object Ai appears in B is Pi.

What I Have Tried

(These snipits are in PHP simply for the purposes of testing) I first made list A

$list = [
    "A" => 2.5, 
    "B" => 2.5, 
    "C" => 2.5, 
    "D" => 2.5, 
    "E" => 2.5, 
    "F" => 2.5, 
    "G" => 2.5, 
    "H" => 2.5, 
    "I" => 5,   
    "J" => 5,   
    "K" => 2.5, 
    "L" => 2.5, 
    "M" => 2.5, 
    "N" => 2.5, 
    "O" => 2.5, 
    "P" => 2.5, 
    "Q" => 2.5, 
    "R" => 2.5, 
    "S" => 2.5, 
    "T" => 2.5, 
    "U" => 5,   
    "V" => 5,   
    "W" => 5,   
    "X" => 5,   
    "Y" => 5,   
    "Z" => 20   
];

At first I tried the following two algorthms (These are in PHP simply for the purposes of testing):

$result = [];

while (count($result) < 10) {
    $rnd = rand(0,10000000) / 100000;

    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
        if ($rnd <= $sum) {
            if (in_array($key,$result)) {
                break;
            } else {
                $result[] = $key;
                break;
            }
        }
    }
}

AND

$result = [];

while (count($result) < 10) {
    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
    }

    $rnd = rand(0,$sum * 100000) / 100000;

    $sum = 0;
    foreach ($list as $key => $value) {
        $sum += $value;
        if ($rnd <= $sum) {
            $result[] = $key;
            unset($list[$key]);
            break;
        }
    }
}

The only differences between the two algorithms is that one tries again when it encounters a duplicate, and one removes the object form list A when it is picked. As it turns out, these two algorithms have the same probability outputs.

I ran the second algorithm 100,000 times and kept track of how many times each letter was picked. The following array contians the percentage chance that a letter is picked in any list B based off of the 100,000 tests.

[A] => 30.213
[B] => 29.865
[C] => 30.357
[D] => 30.198
[E] => 30.152
[F] => 30.472
[G] => 30.343
[H] => 30.011
[I] => 51.367
[J] => 51.683
[K] => 30.271
[L] => 30.197
[M] => 30.341
[N] => 30.15
[O] => 30.225
[P] => 30.135
[Q] => 30.406
[R] => 30.083
[S] => 30.251
[T] => 30.369
[U] => 51.671
[V] => 52.098
[W] => 51.772
[X] => 51.739
[Y] => 51.891
[Z] => 93.74

When looking back at the algorithm this makes sense. The algorithm incorrectly interpreted the original percentages to be the percentage chance that an object is picked for any given location, not any list B. So for example, in reality, the chance that Z is picked in a list B is 93%, but the chance that Z is picked for an index Bn is 20%. This is NOT what I want. I want the chance that Z is picked in a list B to be 20%.

Is this even possible? How can it be done?

EDIT 1

I tried simply having the sum of all Pi = k, this worked if all Pi are equal, but after modifying their values, it started to get more and more wrong.

Initial Probabilities

$list= [
    "A" => 8.4615,
    "B" => 68.4615,
    "C" => 13.4615,
    "D" => 63.4615,
    "E" => 18.4615,
    "F" => 58.4615,
    "G" => 23.4615,
    "H" => 53.4615,
    "I" => 28.4615,
    "J" => 48.4615,
    "K" => 33.4615,
    "L" => 43.4615,
    "M" => 38.4615,
    "N" => 38.4615,
    "O" => 38.4615,
    "P" => 38.4615,
    "Q" => 38.4615,
    "R" => 38.4615,
    "S" => 38.4615,
    "T" => 38.4615,
    "U" => 38.4615,
    "V" => 38.4615,
    "W" => 38.4615,
    "X" => 38.4615,
    "Y" =>38.4615,
    "Z" => 38.4615
];

Results after 10,000 runs

Array
(
    [A] => 10.324
    [B] => 59.298
    [C] => 15.902
    [D] => 56.299
    [E] => 21.16
    [F] => 53.621
    [G] => 25.907
    [H] => 50.163
    [I] => 30.932
    [J] => 47.114
    [K] => 35.344
    [L] => 43.175
    [M] => 39.141
    [N] => 39.127
    [O] => 39.346
    [P] => 39.364
    [Q] => 39.501
    [R] => 39.05
    [S] => 39.555
    [T] => 39.239
    [U] => 39.283
    [V] => 39.408
    [W] => 39.317
    [X] => 39.339
    [Y] => 39.569
    [Z] => 39.522
)
Hurricane Development
  • 2,449
  • 1
  • 19
  • 40
  • `The probability that an object An appears in B is Pn.` That's tricky, and I believe it is not what you want. Specifically, if `k=n/2`, at least half of the elements should have `B_i>=1/2`. – amit Jun 24 '15 at 07:14
  • @amit I am pretty sure this is what I want but I am open to the possibility that I did not describe my goal properly. `K != n/2` bust rather `K < n/2` and usually a lot less than `n/2`, take a look at the example numbers I said above. I do not understand what `B_i` means aswell. – Hurricane Development Jun 24 '15 at 07:17
  • 1
    IN the example the probability to generate "A" is 2.5? It's not probability in this case, probability is bounded to be in range `[0,1]` – amit Jun 24 '15 at 07:21
  • @amit Those are percentages, so 2.5% -> .025 – Hurricane Development Jun 24 '15 at 07:30
  • OK, now I am following you, and yes you are correct in your terminology. thanks for clarifying, will try to come up with an answer. – amit Jun 24 '15 at 07:34
  • @amit Fantastic thank you so much, Good luck! – Hurricane Development Jun 24 '15 at 07:35
  • maybe i just don't understand that kind of math... but should not your results sum up to 100, too? – Burki Jun 24 '15 at 07:47
  • So, I believe, here is [your answer](http://stackoverflow.com/a/21572842/2637490) - as long as you'll keep probabilities as normalized values, like 0.75 means "75%". And also, you my want to add duplication checks - but that won't change algo itself – Alma Do Jun 24 '15 at 07:51
  • @Burki I am not sure how to explain it but no. Those are the "ratios" of the sum of all the results. If k=1 then in would say yes, but those probabilities are for the whole list B, not just each item in the list. – Hurricane Development Jun 24 '15 at 07:54
  • so those 93.74% probability say this is the chance that there is one Z in the whole list? If yes then i think i may have understood your question – Burki Jun 24 '15 at 07:56
  • Also, might get better anwers from statistics expert at [cross validates](http://stats.stackexchange.com/) Not sure it's 100% on-topic there, but these guys know their probability. – amit Jun 24 '15 at 08:08
  • @Burki Yea now you got it – Hurricane Development Jun 24 '15 at 15:54

3 Answers3

4

We must have sum_i P_i = k, or else we cannot succeed.

As stated, the problem is somewhat easy, but you may not like this answer, on the grounds that it's "not random enough".

Sample a uniform random permutation Perm on the integers [0, n)
Sample X uniformly at random from [0, 1)
For i in Perm
    If X < P_i, then append A_i to B and update X := X + (1 - P_i)
    Else, update X := X - P_i
End

You'll want to approximate the calculations involving real numbers with fixed-point arithmetic, not floating-point.

The missing condition is that the distribution have a technical property called "maximum entropy". Like amit, I cannot think of a good way to do this. Here's a clumsy way.

My first (and wrong) instinct for solving this problem was to include each A_i in B independently with probability P_i and retry until B is the right length (there won't be too many retries, for reasons that you can ask math.SE about). The problem is that the conditioning messes up the probabilities. If P_1 = 1/3 and P_2 = 2/3 and k = 1, then the outcomes are

{}: probability 2/9
{A_1}: probability 1/9
{A_2}: probability 4/9
{A_1, A_2}: probability 2/9,

and the conditional probabilities are actually 1/5 for A_1 and 4/5 for A_2.

Instead, we should substitute new probabilities Q_i that yield the proper conditional distribution. I don't know of a closed form for Q_i, so I propose to find them using a numerical optimization algorithm like gradient descent. Initialize Q_i = P_i (why not?). Using dynamic programming, it's possible to find, for the current setting of Q_i, the probability that, given an outcome with l elements, that A_i is one of those elements. (We only care about the l = k entry, but we need the others to make the recurrences work.) With a little more work, we can get the whole gradient. Sorry this is so sketchy.

In Python 3, using a nonlinear solution method that seems to converge always (update each q_i simultaneously to its marginally correct value and normalize):

#!/usr/bin/env python3
import collections
import operator
import random


def constrained_sample(qs):
    k = round(sum(qs))
    while True:
        sample = [i for i, q in enumerate(qs) if random.random() < q]
        if len(sample) == k:
            return sample


def size_distribution(qs):
    size_dist = [1]
    for q in qs:
        size_dist.append(0)
        for j in range(len(size_dist) - 1, 0, -1):
            size_dist[j] += size_dist[j - 1] * q
            size_dist[j - 1] *= 1 - q
    assert abs(sum(size_dist) - 1) <= 1e-10
    return size_dist


def size_distribution_without(size_dist, q):
    size_dist = size_dist[:]
    if q >= 0.5:
        for j in range(len(size_dist) - 1, 0, -1):
            size_dist[j] /= q
            size_dist[j - 1] -= size_dist[j] * (1 - q)
        del size_dist[0]
    else:
        for j in range(1, len(size_dist)):
            size_dist[j - 1] /= 1 - q
            size_dist[j] -= size_dist[j - 1] * q
        del size_dist[-1]
    assert abs(sum(size_dist) - 1) <= 1e-10
    return size_dist


def test_size_distribution(qs):
    d = size_distribution(qs)
    for i, q in enumerate(qs):
        d1a = size_distribution_without(d, q)
        d1b = size_distribution(qs[:i] + qs[i + 1 :])
        assert len(d1a) == len(d1b)
        assert max(map(abs, map(operator.sub, d1a, d1b))) <= 1e-10


def normalized(qs, k):
    sum_qs = sum(qs)
    qs = [q * k / sum_qs for q in qs]
    assert abs(sum(qs) / k - 1) <= 1e-10
    return qs


def approximate_qs(ps, reps=100):
    k = round(sum(ps))
    qs = ps[:]
    for j in range(reps):
        size_dist = size_distribution(qs)
        for i, p in enumerate(ps):
            d = size_distribution_without(size_dist, qs[i])
            d.append(0)
            qs[i] = p * d[k] / ((1 - p) * d[k - 1] + p * d[k])
        qs = normalized(qs, k)
    return qs


def test(ps, reps=100000):
    print(ps)
    qs = approximate_qs(ps)
    print(qs)
    counter = collections.Counter()
    for j in range(reps):
        counter.update(constrained_sample(qs))
    test_size_distribution(qs)
    print("p", "Actual", sep="\t")
    for i, p in enumerate(ps):
        print(p, counter[i] / reps, sep="\t")


if __name__ == "__main__":
    test([2 / 3, 1 / 2, 1 / 2, 1 / 3])
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
3

Let's analyze it for a second. With replacements: (not what you want, but simpler to analyze).

Given a list L of size k, and and element a_i, the probability for a_i to be in the list is denoted by your value p_i.

Let's examine the probability of a_i to be at some index j in the list. Let's denote that probability as q_i,j. Note that for any index t in the list, q_i,j = q_i,t - so we can simply say q_i_1=q_i_2=...=q_i_k=q_i.

The probability that a_i will be anywhere in the list is denoted as:

1-(1-q_i)^k

But it is also p_i - so we need to solve the equation

1-(1-q_i)^k = pi
1 - (1-q_i)^k -pi = 0

One way to do it is newton-raphson method.

After calculating the probability for each element, check if its indeed a proabability space (sums to 1, all probabilities are in [0,1]). If it's not - it cannot be done for given probabilities and k.


Without replacement: This is trickier, since now q_i,j != q_i,t (the selections are not i.i.d). Calculations for probability here will be much trickier, and I am not sure at the moment how to calculate them, it will be needed to be done in run-time, during the creation of the list I suppose.

(Deleted a solution that I am almost certain is biased).

amit
  • 175,853
  • 27
  • 231
  • 333
  • Yea so your logic for **With Replacements** is perfect, I tested this and it worked, but as you noted it does not work for **Without Replacements**. I have a feeling that after every single item is picked, all `q_i` must be recalculated. I know that one condition for the problem is that `sum_i p_i = k` however is this necessary for every step? For example, after I pick the first item all `q_i` will recalculate and then should `sum_i p_i still = k` where k is now `k - 1`? – Hurricane Development Jun 24 '15 at 16:40
  • @HurricaneDevelopment Yes, it's necessary. There doesn't seem to be a nice way to calculate the `q_i`s. Certainly simple normalization is not the answer. – David Eisenstat Jun 24 '15 at 18:58
  • @DavidEisenstat I am trying to replicate the algorithm **With replacements** but the results do not seem to be matching with the input percentages, mind taking a look? Its too much to fit in the comment so here is a pastebin. http://pastebin.com/S5jfaJyF – Hurricane Development Jun 24 '15 at 19:35
  • @HurricaneDevelopment I don't see the part where you checked that the `q_i`s summed to `1`. I'm guessing that they don't. – David Eisenstat Jun 24 '15 at 19:40
  • @DavidEisenstat Yea you are correct, so `q_i` must sum to 1 and and `p_i` must also sum to `k`? – Hurricane Development Jun 24 '15 at 19:48
  • @HurricaneDevelopment Yes, as I understand things. I don't think that both conditions will hold except in special cases, however. – David Eisenstat Jun 24 '15 at 19:51
0

Unless my math skills are a lot weaker than i think an average chance of an Element from list A in your example being found in list B should be 10/26 = 0.38.
If you lower this chance for any object, there must be others with higher chances. Also, your probabilites from list A cannot compute: they are too low: you could not fill your list / you don't have enough elements to pick from.

Assuming the above is correct (or correct enough), that would mean that in your list A your average weight would have to be the average chance of a random pick. That, in turn, means your probabilities in list a don't sum up to 100.

Unless i am completely wrong, that is...

Burki
  • 1,188
  • 19
  • 28
  • You are not completely wrong, but unfortunately I have tried some of this and it did not work. Your 10/26 is very close, however I believe that because there are no replacements that actual calc is `25_C_9 / 26_C_10` which turns out to be `.3846`. Anyways, the idea is the same. And I read somewhere that the sum of all `p_i` must equal K. `26 * .3846 = 10`. I tried this initially 10,000 times with all `p_i = .3846` and it worked! But after altering the numbers keeping the constraint `sum_i p_i = k` it stops working. Please view my edit to see the results. – Hurricane Development Jun 24 '15 at 16:13