3

When drawing in random from a set of values in succession, where a drawn value is allowed to be drawn again, a given value has (of course) a small chance of being drawn twice (or more) in immediate succession, but that causes an issue (for the purposes of a given application) and we would like to eliminate this chance. Any algorithmic ideas on how to do so (simple/efficient)?

Ideally we would like to set a threshold say as a percentage of the size of the data set:

Say the size of the set of values N=100, and the threshold T=10%, then if a given value is drawn in the current draw, it is guaranteed not to show up again in the next N*T=10 draws.

Obviously this restriction introduces bias in the random selection. We don't mind that a proposed algorithm introduces further bias into the randomness of selection, what really matters for this application is that the selection is just random enough to appear so for a human observer.

As an implementation detail, the values are stored as database records, so database table flags/values can be used, or maybe external memory structures. Answers about the abstract case are welcome too.

Edit:

I just hit this other SO question here, which has good overlap with my own. Going through the good points there.

Community
  • 1
  • 1
Basel Shishani
  • 7,735
  • 6
  • 50
  • 67
  • Cheats solution is just to always act as if T = 100% and just shuffle the available numbers and return those shuffled numbers in turn. – Damien_The_Unbeliever Oct 16 '13 at 10:31
  • Maintain a queue of outputted items. As a number is outputted, remove it from the list and push it in the queue. After 10 trials, pop it back to the list. After the 1st 10 trials, you will push one number and pop one number from the queue. – Abhishek Bansal Oct 16 '13 at 10:39
  • @Damien: It wouldn't guarantee non-repeat, as the next shuffle can give us the last drawn element in first position (or first T positons). So edges are always a problem. – Basel Shishani Oct 16 '13 at 11:48
  • @BaselShishani In this idea, you will shuffle the items only once and then just output them index wise by simply iterating over them. – Abhishek Bansal Oct 16 '13 at 12:03
  • So just repeat cycling through the same list you mean? Won't feel random after a while! – Basel Shishani Oct 16 '13 at 12:12
  • 1
    Why not just model how you would do it when physically drawing items from a bag. On draw, remove the item from the collection. Don't replace the item back into the collection until the required numbers of draws has passed. – mbeckish Oct 16 '13 at 13:52

5 Answers5

2

Here's an implementation that does the whole process in O(1) (for a single element) without any bias:

The idea is to treat the last K elements in the array A (which contains all the values) like a queue, we draw a value from the first N-k values in A, which is the random value, and swap it with an element in position N-Pointer, when Pointer represents the head of the queue, and it resets to 1 when it crosses K elements.

To eliminate any bias in the first K draws, the random value will be drawn between 1 and N-Pointer instead of N-k, so this virtual queue is growing in size at each draw until reaching the size of K (e.g. after 3 draws the number of possible values appear in A between indexes 1 and N-3, and the suspended values appear in indexes N-2 to N.

All operations are O(1) for drawing a single elemnent and there's no bias throughout the entire process.

void DrawNumbers(val[] A, int K)
{
    N = A.size;
    random Rnd = new random;
    int Drawn_Index;
    int Count_To_K = 1;
    int Pointer = K;

    while (stop_drawing_condition)
    {
        if (Count_To_K <= K)
        {
            Drawn_Index = Rnd.NextInteger(1, N-Pointer);
            Count_To_K++;
        }

        else
        {
            Drawn_Index = Rnd.NextInteger(1, N-K)
        }

        Print("drawn value is: " + A[Drawn_Index])

        Swap(A[Drawn_Index], A[N-Pointer])
        Pointer--;
        if (Pointer < 1) Pointer = K; 
    }
}

My previous suggestion, by using a list and an actual queue, is dependent on the remove method of the list, which I believe can be at best O(logN) by using an array to implement a self balancing binary tree, as the list has to have direct access to indexes.

void DrawNumbers(list N, int K)
{
    queue Suspended_Values = new queue;
    random Rnd = new random;
    int Drawn_Index;

    while (stop_drawing_condition)
    {
          if (Suspended_Values.count == K)
                N.add(Suspended_Value.Dequeue());

          Drawn_Index = Rnd.NextInteger(1, N.size) // random integer between 1 and the number of values in N

          Print("drawn value is: " + N[Drawn_Index]);          

          Suspended_Values.Enqueue(N[Drawn_Index]);
          N.Remove(Drawn_Index);
    }
}
Ron Teller
  • 1,880
  • 1
  • 12
  • 23
  • Do any hashmaps support O(1) random access (which will be required to get the element to remove, as far as I understand)? I know insert and remove are O(1) (expected), but not typically random access as well. – Bernhard Barker Oct 16 '13 at 14:02
  • @Dukeling I think you're right and that may be a problem, I thought of two other options: (1) using a regular array, in the first `k` calls, narrowing the array, which means copying all values one index to the side, which takes `O(n)`, after the first `k` calls it takes `O(1)` as you can insert the dequeue value to the index of the current selected value. (2) using a `O(logN)`-remove data structure implemented with an array like a self-balancing binary tree. got another idea maybe? – Ron Teller Oct 16 '13 at 14:28
  • 1
    @RonTeller: Why dont you simply use 2 arrays? Fill the first one (A) with N-K elements in random order, fill the second one (B) with K elements. Have a "current pointer" pB for B. Pick one element randomly from A (keep value for return of course) and swap it with pB mod K in B. Increment pB by one. Shouldn't that solve all problems? Could even be done with a single array easily. – igrimpe Oct 16 '13 at 14:54
  • @RonTeller A BST would work. I use one in the shuffled-based approach in my answer. – Bernhard Barker Oct 16 '13 at 15:20
  • @igrimpe that's a pretty good idea, though it has a small bias towards the elements you would initially fill B with, I made an adjustment to eliminate this bias, see my implementation. – Ron Teller Oct 17 '13 at 08:00
  • @RonTeller: I dont smell anything ;) I think that because of the initial shuffling there is no bias against the "pre-delayed" values. The current implementation is exactly what I thought about, though I doubt it would be "better" (regarding bias), it just saves from the inital shuffling. btw: Any good (free + Win32) tool to "measure" such stuff? I got ent.exe from fourmilab.ch but dont know how good it is. – igrimpe Oct 17 '13 at 08:32
  • @igrimpe i'll try to put it this way, all values has the same chance of being biased, i know it's a weird sentence, but think about the value that was initially put in the end of `B`, it has no chance of being selected in the first `K` calls even though it's not suspended – Ron Teller Oct 17 '13 at 09:04
  • To eliminate bias in the very first run, maybe do a limited shuffle in the last K section, i.e by swapping at random with N-K section. (Apologies if this has been mentioned by others, the threads are getting hard to follow!) – Basel Shishani Oct 18 '13 at 03:54
2

Say you have n items in your list, and you don't want any of the k last items to be selected.

Select at random from an array of size n-k, and use a queue of size k to stick the items you don't want to draw (adding to the front and removing from the back).

All operations are O(1).

---- clarification ----

Give n items, and a goal of not redrawing any of the last k draws, create an array and queue as follows.

  1. Create an array A of size n-k, and put n-k of your items in the list (chosen at random, or seeded however you like).

  2. Create a queue (linked list) Q and populate it with the remaining k items, again in random order or whatever order you like.

Now, each time you want to select an item at random:

  1. Choose a random index from your array, call this i.

  2. Give A[i] to whomever is asking for it, and add it to the front of Q.

  3. Remove the element from the back of Q, and store it in A[i].

Everything is O(1) after the array and linked list are created, which is a one-time O(n) operation.

Now, you might wonder, what do we do if we want to change n (i.e. add or remove an element).

Each time we add an element, we either want to grow the size of A or of Q, depending on our logic for deciding what k is (i.e. fixed value, fixed fraction of n, whatever...).

If Q increases then the result is trivial, we just append the new element to Q. In this case I'd probably append it to the end of Q so that it gets in play ASAP. You could also put it in A, kicking some element out of A and appending it to the end of Q.

If A increases, you can use a standard technique for increasing arrays in amortized constant time. E.g., each time A fills up, we double it in size, and keep track of the number of cells of A that are live. (look up 'Dynamic Arrays' in Wikipedia if this is unfamiliar).

Dave
  • 7,460
  • 3
  • 26
  • 39
  • Which type of list supports removing an index in `O(1)`? – Ron Teller Oct 17 '13 at 07:42
  • 1
    @RonTeller The underlying data structures are an array and a queue (the 'list' is just what you start with). @ Dave You may want to elaborate a bit, i.e. that you replace the element removed from the array by the element popped from the queue, or that you can achieve O(1) remove by swapping with the last element and decreasing the size. – Bernhard Barker Oct 17 '13 at 08:36
  • @Dukeling if you remove a item from the array at certain index, what do you do with the empty spot in the first K calls? this explanation is way to simplified – Ron Teller Oct 17 '13 at 09:01
  • @RonTeller For this you can do O(1) removal by swapping with the last element and decreasing the size. Admittedly this answer needs to be elaborated upon, but the basic idea is good. – Bernhard Barker Oct 17 '13 at 09:04
  • @Dukeling decreasing the size is not O(1), it's O(n) with an array. you can't get around this with this logic... – Ron Teller Oct 17 '13 at 09:07
  • @RonTeller If you just keep a variable to indicate the size rather than doing a reallocation (exactly as Java's `ArrayList` and C++'s `vector` does), decreasing the size would just involve modifying this variable, which can obviously be done in O(1). – Bernhard Barker Oct 17 '13 at 09:08
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/39404/discussion-between-ron-teller-and-dukeling) – Ron Teller Oct 17 '13 at 09:09
  • @RonTeller I added a clarifying note to the end of my answer. – Dave Oct 17 '13 at 13:30
2

I assume you have an array, A, that contains the items you want to draw. At each time period you randomly select an item from A.

You want to prevent any given item, i, from being drawn again within some k iterations.

Let's say that your threshold is 10% of A.

So create a queue, call it drawn, that can hold threshold items. Also create a hash table that contains the drawn items. Call the hash table hash.

Then:

do
{
    i = Get random item from A
    if (i in hash)
    {
        // we have drawn this item recently. Don't draw it.
        continue;
    }
    draw(i);
    if (drawn.count == k)
    {
        // remove oldest item from queue
        temp = drawn.dequeue();
        // and from the hash table
        hash.remove(temp);
    }
    // add new item to queue and hash table
    drawn.enqueue(i);
    hash.add(i);
} while (forever);

The hash table exists solely to increase lookup speed. You could do without the hash table if you're willing to do a sequential search of the queue to determine if an item has been drawn recently.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • 1
    And if you are using the database tables directly as your data structure, then `A` is your current table of values, `drawn` is a new table you would create to keep track of which items are currently drawn from the collection and the order in which they were drawn, and `hash` is an index on `drawn`. – mbeckish Oct 16 '13 at 16:56
1

I'd put all "values" into a "list" of size N, then shuffle the list and retrieve values from the top of the list. Then you "insert" the retrieved value at a random position with any index >= N*T.

Unfortunately I'm not truly a math-guy :( So I simply tried it (in VB, so please take it as pseudocode ;) )

Public Class BiasedRandom

Private prng As New Random
Private offset As Integer
Private l As New List(Of Integer)

Public Sub New(ByVal size As Integer, ByVal threshold As Double)

    If threshold <= 0 OrElse threshold >= 1 OrElse size < 1 Then Throw New System.ArgumentException("Check your params!")
    offset = size * threshold
    ' initial fill
    For i = 0 To size - 1
        l.Add(i)
    Next
    ' shuffle "Algorithm p"
    For i = size - 1 To 1 Step -1
        Dim j = prng.Next(0, i + 1)
        Dim tmp = l(i)
        l(i) = l(j)
        l(j) = tmp
    Next

End Sub

Public Function NextValue() As Integer

    Dim tmp = l(0)
    l.RemoveAt(0)
    l.Insert(prng.Next(offset, l.Count + 1), tmp)
    Return tmp

End Function

End Class

Then a simple check:

Public Class Form1
Dim z As Integer = 10
Dim k As BiasedRandom

Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    k = New BiasedRandom(z, 0.5)
End Sub

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

    Dim j(z - 1)
    For i = 1 To 10 * 1000 * 1000
        j(k.NextValue) += 1
    Next
    Stop
End Sub

End Class

And when I check out the distribution it looks okay enough for an unarmed eye ;)

EDIT: After thinking about RonTeller's argumentation, I have to admit that he is right. I don't think that there is a performance friendly way to achieve the wanted and to pertain a good (not more biased than required) random order. I came to the follwoing idea:

Given a list (array whatever) like this:

0123456789 ' not shuffled to make clear what I mean

We return the first element which is 0. This one must not come up again for 4 (as an example) more draws but we also want to avoid a strong bias. Why not simply put it to the end of the list and then shuffle the "tail" of the list, i.e. the last 6 elements?

1234695807

We now return the 1 and repeat the above steps.

2340519786

And so on and so on. Since removing and inserting is kind of unnecessary work, one could use a simple array and a "pointer" to the actual element. I have changed the code from above to give a sample. It's slower than the first one, but should avoid the mentioned bias.

Public Function NextValue() As Integer

    Static current As Integer = 0
    ' only shuffling a part of the list
    For i = current + l.Count - 1 To current + 1 + offset Step -1
        Dim j = prng.Next(current + offset, i + 1)
        Dim tmp = l(i Mod l.Count)
        l(i Mod l.Count) = l(j Mod l.Count)
        l(j Mod l.Count) = tmp
    Next
    current += 1

    Return l((current - 1) Mod l.Count)

End Function

EDIT 2:

Finally (hopefully), I think the solution is quite simple. The below code assumes that there is an array of N elements called TheArray which contains the elements in random order (could be rewritten to work with sorted array). The value DelaySize determines how long a value should be suspended after it has been drawn.

Public Function NextValue() As Integer

    Static current As Integer = 0

    Dim SelectIndex As Integer = prng.Next(0, TheArray.Count - DelaySize)
    Dim ReturnValue = TheArray(SelectIndex)
    TheArray(SelectIndex) = TheArray(TheArray.Count - 1 - current Mod DelaySize)
    TheArray(TheArray.Count - 1 - current Mod DelaySize) = ReturnValue
    current += 1
    Return ReturnValue

End Function
igrimpe
  • 1,775
  • 11
  • 12
  • don't know how much it matters to the OP, but this will generate a huge bias through out the entire process towards values that were initially shuffled to the beginning of the list. – Ron Teller Oct 16 '13 at 11:13
  • @Ron: I don't see the strong bias, as the values are constantly dispersed over a large section at the bottom. My issue would be though with the efficiency of insertion. – Basel Shishani Oct 16 '13 at 11:28
  • @BaselShishani think about the last value in the list after it was shuffled, it has only a chance of 1/N to be pushed further after a draw, while the first value on the list have N-1/N chance of being pushed. – Ron Teller Oct 16 '13 at 11:36
  • @Ron: He is doing the shuffle only once initially, then he always keeps drawing from the top repeatedly, he pops the drawn value at the top an re-inserts randomly at the bottom (1-T)*N section, which can also be after the last element. Are you thinking in these same terms? – Basel Shishani Oct 16 '13 at 12:06
  • @BaselShishani It **can** be after the last element, but in most cases it won't, unlike the the elements in the beginning. for example, think about a list containing only the numbers 1,2,3 and no suspension, let's say you shuffled the list and this was the order: 1,2,3. so 3 pops first, now you insert it before 2, before 1 or after 1, so even after being drawn, it has 2/3 chance of being picked again before 1. the bias gets even worse when the number of values grow. – Ron Teller Oct 16 '13 at 12:20
1

Set-based approach:

If the threshold is low (say below 40%), the suggested approach is:

  • Have a set and a queue of the last N*T generated values.
  • When generating a value, keep regenerating it until it's not contained in the set.
  • When pushing to the queue, pop the oldest value and remove it from the set.

Pseudo-code:

generateNextValue:
  // once we're generated more than N*T elements,
  //   we need to start removing old elements
  if queue.size >= N*T
    element = queue.pop
    set.remove(element)

  // keep trying to generate random values until it's not contained in the set
  do
    value = getRandomValue()
  while set.contains(value)

  set.add(value)
  queue.push(value)

  return value

If the threshold is high, you can just turn the above on its head:

  • Have the set represent all values not in the last N*T generated values.
  • Invert all set operations (replace all set adds with removes and vice versa and replace the contains with !contains).

Pseudo-code:

generateNextValue:
  if queue.size >= N*T
    element = queue.pop
    set.add(element)

  // we can now just get a random value from the set, as it contains all candidates,
  //   rather than generating random values until we find one that works
  value = getRandomValueFromSet()
  //do
  //  value = getRandomValue()
  //while !set.contains(value)

  set.remove(value)
  queue.push(value)

  return value

Shuffled-based approach: (somewhat more complicated that the above)

If the threshold is a high, the above may take long, as it could keep generating values that already exists.

In this case, some shuffle-based approach may be a better idea.

  • Shuffle the data.
  • Repeatedly process the first element.
  • When doing so, remove it and insert it back at a random position in the range [N*T, N].

Example:

Let's say N*T = 5 and all possible values are [1,2,3,4,5,6,7,8,9,10].

Then we first shuffle, giving us, let's say, [4,3,8,9,2,6,7,1,10,5].

Then we remove 4 and insert it back in some index in the range [5,10] (say at index 5).

Then we have [3,8,9,2,4,6,7,1,10,5].

And continue removing the next element and insert it back, as required.

Implementation:

An array is fine if we don't care about efficient a whole lot - to get one element will cost O(n) time.

To make this efficient we need to use an ordered data structure that supports efficient random position inserts and first position removals. The first thing that comes to mind is a (self-balancing) binary search tree, ordered by index.

We won't be storing the actual index, the index will be implicitly defined by the structure of the tree.

At each node we will have a count of children (+ 1 for itself) (which needs to be updated on insert / remove).

An insert can be done as follows: (ignoring the self-balancing part for the moment)

// calling function
insert(node, value)
  insert(node, N*T, value)

insert(node, offset, value)
  // node.left / node.right can be defined as 0 if the child doesn't exist
  leftCount = node.left.count - offset
  rightCount = node.right.count

  // Since we're here, it means we're inserting in this subtree,
  //   thus update the count
  node.count++

  // Nodes to the left are within N*T, so simply go right
  // leftCount is the difference between N*T and the number of nodes on the left,
  //   so this needs to be the new offset (and +1 for the current node)
  if leftCount < 0
    insert(node.right, -leftCount+1, value)
  else
    // generate a random number,
    //   on [0, leftCount), insert to the left
    //   on [leftCount, leftCount], insert at the current node
    //   on (leftCount, leftCount + rightCount], insert to the right
    sum = leftCount + rightCount + 1
    random = getRandomNumberInRange(0, sum)
    if random < leftCount
      insert(node.left, offset, value)
    else if random == leftCount
      // we don't actually want to update the count here
      node.count--
      newNode = new Node(value)
      newNode.count = node.count + 1
      // TODO: swap node and newNode's data so that node's parent will now point to newNode
      newNode.right = node
      newNode.left = null
    else
      insert(node.right, -leftCount+1, value)

To visualize inserting at the current node:

If we have something like:

    4
   /
  1
 / \
2   3

And we want to insert 5 where 1 is now, it will do this:

  4
 /
5
 \
  1
 / \
2   3

Note that when a red-black tree, for example, performs operations to keep itself balanced, none of these involve comparisons, so it doesn't need to know the order (i.e. index) of any already-inserted elements. But it will have to update the counts appropriately.

The overall efficiency will be O(log n) to get one element.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • Sorry I'm not clear on what the do loop in the middle is doing, is randomvalue() drawing from set? can you please modify your code by adding a return statement to show what's being generated at what point, and initial states of set, queue. Thnx. – Basel Shishani Oct 16 '13 at 11:52
  • @BaselShishani Added a return statement. `randomValue` (renamed to `getRandomValue`) gets returns a random possible value (ignoring any previously generated values). The do-while loop ensures that it's not the same as a recently generated value. – Bernhard Barker Oct 16 '13 at 12:06
  • @BaselShishani Updated my answer with an elaboration on the shuffle-based approach. – Bernhard Barker Oct 16 '13 at 13:27
  • If you save the value list in a data structure with `O(1)` remove (e.g hashmap), when you choose a random value, you remove it from the list (so there's no chance of reselecting unavailable values) and add it to the queue, when dequeueing, you add the value back to the list (`O(1)` as it doesn't matter where you add it), in total it's `O(1)` for each element. – Ron Teller Oct 16 '13 at 13:39
  • @RonTeller But if the threshold is 99%, 99% of the generated values will be in the set, thus it could take really long to find one that isn't. At a threshold of 10%, this won't be a problem, but above 40% the performance can start dropping and above 80% it can start becoming quite slow. – Bernhard Barker Oct 16 '13 at 13:47
  • @Dukeling values that are "suspended" are not in the set, you remove them from the set when they were drawn, and bring them back only when dequeued – Ron Teller Oct 16 '13 at 13:48
  • @RonTeller Oh, I think I get it now. You just reversed the meaning of my set. – Bernhard Barker Oct 16 '13 at 14:00
  • You can get constant time (and trivial implementation) by sticking the elements you don't want in a queue. E.g., given n elts, k that we don't want to repeat, select at random from an array of size n-k, move the selected element to the front of the queue, and replace it with the element from the back of the queue. – Dave Oct 16 '13 at 16:14
  • @DaveGalvin Yes, it does seem like I was overthinking this. – Bernhard Barker Oct 16 '13 at 16:38