An efficient version/alternative to the alias method that samples without replacement

Question

I am writing a quiz/teaching game in HTML5/JS in which the player is offered a set of 10 questions from a larger mastery set. The game keeps track of the players' scores over time and is more likely to choose questions from the question list that the player is having trouble with.

To construct the probability distribution list, I employ the alias method as below, which selects an item in O(1) time while completely adhering to distribution:

function generate_random_question_selector() {
    // Generates a random selector function using the Alias Method
    // for discrete probability distributions (see
    // https://en.wikipedia.org/wiki/Alias_method for an explanation)
    var i = 0;
    var probabilities = [], aliases = [];
    var probSum = 0;

    /* ... Business logic to fill probabilities array ... */

    // Normalize all probabilities to average to 1
    // and categorize each probability as to where it fits
    // in that scale
    var probMultiplier = probabilities.length / probSum;
    var overFull = [], underFull = [];
    probabilities = probabilities.map(function(p, i) {
        var newP = p * probMultiplier;
        if (newP > 1) overFull.push(i);
        else if (newP < 1) underFull.push(i);
        else if (newP !== 1) {
            throw "Non-numerical value got into scores";
        }
        return newP;
    });
    overFull.sort();
    underFull.sort();

    // Process both queues by having each under-full entry
    // have the rest of its space occupied by the fullest
    // over-full entry, re-categorizing the over-full entry
    // as needed
    while (overFull.length > 0 || underFull.length > 0) {
        if (!(overFull.length > 0 && underFull.length > 0)) {
            // only reached due to rounding errors.
            // Just assign all the remaining probabilities to 1
            var notEmptyArray = overFull.length > 0 ? overFull : underFull;
            notEmptyArray.forEach(function(index) {
                probabilities[index] = 1;
            });
            break; // get out of the while loop
        }

        aliases[underFull[0]] = overFull[0];
        probabilities[overFull[0]] += probabilities[underFull[0]] - 1;
        underFull.shift();
        if (probabilities[overFull[0]] > 1) overFull.push(overFull.shift());
        else if (probabilities[overFull[0]] < 1) underFull.push(overFull.shift());
        else overFull.shift();
    }

    return function() {
        var index = Math.floor(Math.random() * probabilities.length);
        return Math.random() < probabilities[index] ? index : aliases[index];
    }
}

This method works really well, but part of my business specifications are that the questions do not repeat. I currently use a naive reroll technique to accomplish this, but it's obvious that this will break if less than 10 items are far more probable than the rest of the items:

var selectQuestion = generate_random_question_selector();   
var questionSet = [];
for (var i = 0; i < num_questions; i++) {
    var question_num;
    do {
        question_num = selectQuestion();
    } while (questionSet.indexOf(question_num) >= 0)
    questionSet.push(question_num);
}

What can be done to or about this method that would allow me to efficiently sample the questions without replacement?

Your `probabilities` array is only valid for the first call. On second call from your naive reroll, you don't want a previous question, meaning you theoretically want a new probabilities array. You could try to replace last line of code with `if (Math.random() < probabilities[index]) { probabilities[index] = 0;/*nullifying probability*/ return index; } else { return aliases[index]; }`. Or remove the item at index. But don't go with Fisher-Yates, it won't help, as your issue is your reroll. — Cœur, Dec 13 '15 at 03:39
The idea is that I'm trying to replace the reroll with something with more regular time bounds, which would also have to respect the variant probabilities (vanilla Fisher-Yates does not). Simply nullifying the probability does seem like a good idea, but it is a lot more involved than simply setting the probability to zero because I also have to change all the aliases. — TheHans255, Dec 13 '15 at 03:58

score 2 · Answer 1 · answered Dec 13 '15 at 07:00

The alias method is ill-suited to sampling without replacement, because each value is sampled using a different probability distribution, and computing (or updating) the alias table is O(n).

You need a data structure that can be updated more efficiently. For instance, you could build a search tree of all values (where each node stores the total weight of its subtree), which would permit sampling and updating the probability distribution in O(log n).

If we remove entries by setting their probablity to 0, this tree is never structurally modified, and can be encoded into an array.

Here is some code:

function prepare() {
    // index i is the parent of indices 2*i and 2*i+1
    // therefore, index 0 is unused, and index 1 the root of the tree
    var i;
    for (i = weights.length - 1; i > 1; i--) {
        weights[i >> 1] += weights[i];
    }
}

function sample() {
    var index = 1;
    var key = Math.random() * weights[index];

    for (;;) {
        var left = index << 1;
        var right = left + 1;
        leftWeight = weights[left] || 0;
        rightWeight = weights[right] || 0;

        if (key < leftWeight) {
            index = left;
        } else {
            key -= leftWeight;
            if (key < rightWeight) {
                index = right;
            } else {
                return index;
            }
        }
    }
}

function remove(index) {
    var left = index << 1;
    var right = left + 1;
    leftWeight = weights[left] || 0;
    rightWeight = weights[right] || 0;

    var w = weights[index] - leftWeight - rightWeight;
    while (index > 0) {
        weights[index] -= w;
        index = index >> 1;
    }
}

Test code:

function retrieve() {
    var index = sample();
    remove(index);
    console.log(index);
    console.log(weights);
}

weights = [0,1,2,3,4];
prepare();
console.log(weights);
retrieve();
retrieve();
retrieve();
retrieve();

I'm having trouble understanding your code. Could you please refer me to an official paper detailing your algorithm? — TheHans255, Jan 19 '16 at 01:02
Seeing that I came up with the algorithm in response to your question, I am not sure such a paper exists. I am also not sure it would merit a publication, it being a rather straightforward hybrid of sampling from a histogram, and a binary search tree to represent that histogram. — meriton, Jan 19 '16 at 02:01
Would you mind, then, reformulating the algorithm so that it stores the tree nodes as conventional tree node objects with data fields and left/right pointers? — TheHans255, Jan 19 '16 at 02:13

An efficient version/alternative to the alias method that samples without replacement

1 Answers1