5

I have a collection of elements that I want to shuffle randomly, but every element has a different priority or weight. So the element with bigger weight has to have more probabilities to be on the top of the result.

I have this Array:

elements = [
  { :id => "ID_1", :weight => 1 },
  { :id => "ID_2", :weight => 2 },
  { :id => "ID_3", :weight => 6 }
]

And I want to shuffle it so the element with id "ID_3" has ~6 times more probabilities to be first than the element "ID_1" and ~3 times more probabilities than the element "ID_2".

 Update

Clarification: once you have chosen the first position the other elements will fight for the rest positions using the same logic.

Community
  • 1
  • 1
fguillen
  • 36,125
  • 23
  • 149
  • 210
  • What about the second, third etc positions? How are the probabilities there? – Programmer Person May 01 '15 at 05:30
  • @ProgrammerPerson once you have chosen the first position the other elements will _fight_ for the rest positions using the same logic. – fguillen May 06 '15 at 16:41
  • See this: http://stackoverflow.com/questions/3655430/selection-based-on-percentage-weighting. I believe this question has probably been asked at least 5 times before. One thing to note is the Alias method mentioned there. It is quite interesting. – Programmer Person May 07 '15 at 05:11

6 Answers6

8

I can think of two approaches to solve it, though my gut tells me there should be modification for Fisher-Yates to achieve it even better:

O(n*W) solution: (simple to program)

First approach, create duplicates according to the weight (same as your approach), and populate a new list. Now run a standard shuffle (fisher-yates) on this list. Iterate the list and discard all duplicates, and keep only the first occurance of each element. This runs in O(n*W), where n is the number of elements in the list, and W is the average weight (pseudo-polynomial solution).


O(nlogn) solution: (significantly harder to program)

Second approach would be to create a list of sum of weights of the elements:

sum[i] = weight[0] + ... + weight[i]

Now, draw a number, between 0 to sum[n], and chose the first element whose sum is greater/equals this random number.
This will be the next element, discard the element, recreate the list, and repeat.

This runs in O(n^2*logn)

It can be further enhanced by creating a binary tree rather than a list, where each node also stores the value of weights of the entire subtree.
Now, after choosing an element, find the matching element (whose sum up to him is the first one higher than the random selected number), delete the node, and recalculate the weights on the path to the route.
This will take O(n) to create the tree, O(logn) to find the node at each step, and O(logn) to recalculate the sum. Repeat it until the tree is exhausted, and you get O(nlogn) solution.
The idea of this approach is very similar to Order Statistics Trees, but using the sum of weights rather than the number of descendants. The search and balancing after deletion will be done simiarly to order statistics tree.


Explanation to constructing and using the binary tree.

Assume you have elements=[a,b,c,d,e,f,g,h,i,j,k,l,m] with weights=[1,2,3,1,2,3,1,2,3,1,2,3,1]

First construct an almost full binary tree, and populate the elements in it. Note that the tree is NOT Binary search tree, just a regular tree, so order of elements does not matter - and we won't need to maintain it later on.

You will get something like the following tree:

enter image description here

Legend: w - weight of that node, sw - sum of weight for the entire subtree.

Next, calculate sum of weights for each subtree. Start from the leaves, and calculate s.w = w. For every other node calculate s.w = left->s.w + right->s.w, filling the tree from the bottom up (post order traversal).

enter image description here

Building the tree, populating it, and calculating s.w. for each node is done in O(n).

Now, you iteratively need to chose a random number between 0 to sum of weights (the s.w. value of the root, in our case 25). Let that number be r, and find for each such number the matching node.
Finding the matching node is done recursively

if `r< root.left.sw`:
   go to left son, and repeat. 
else if `r<root.left.sw + root.w`:
   the node you are seeking is the root, choose it. 
else:
   go to `root.right` with `r= r-root.left.sw - root.w`

Example, chosing r=10:

Is r<root.left.sw? Yes. Recursively invoke with r=10,root=B (left child)
Is r<root.left.sw No. Is r < root.left.sw + root.w? No. Recursively invoke with r=10-6-2=2, and root=E (right chile)
Is r<root.left.sw? No. Is r < root.left.sw + root.w? Yes. Choose E as next node.

This is done in O(h) = O(logn) per iteration.

Now, you need to delete that node, and reset the weights of the tree.
One approach to deleting that ensures the tree is in logarithmic weight is smilar to binary heap: Replace the chosen node with the bottom rightest node, remove the new rightest bottom node, and rebalance the two branches going from the two involved nodes to the tree.

First switch:

enter image description here

Then recalculate:

enter image description here

Note that recalculation is needed only to two paths, each of depth at most O(logn) (the nodes colored orange in the pic), so deletion and recalculation is also O(logn).

Now, you got yourself a new binary tree, with the modified weights, and you are ready to choose the next candidate, until the tree is exhausted.

amit
  • 175,853
  • 27
  • 231
  • 333
  • Where you say, "Now, draw a number, between `0` to `weight[n]` and chose the first element greater/equals this random number", I believe you meant, "between `0` to `sum[n]`" and it may not be clear that by "first element" you are referring to `sum[i]`. Also, I would suggest `<=`rather than `>=`. If there were two elements with respective weights of 99 and 1, in that order, `sum[0] #=> 99; sum[1] #=> 100`, meaning that, with "first `>=`", the latter would never be selected before the former. – Cary Swoveland May 01 '15 at 01:23
  • I don't understand how a binary tree would be constructed for this problem, and how it would be used. Could you explain what it would like like for the array `elements` having just a few elements? – Cary Swoveland May 01 '15 at 03:34
  • @CarySwoveland Yes, it should have been `sum`, will fix soon. The idea of the binary tree is very similar to [order statistics tree](http://en.wikipedia.org/wiki/Order_statistic_tree), but rather than adding to each node the *size* of the sub tree, you add the sum of weights of the subtree. This allows you in similar manner to find the first element that is `>=x`, for some `x` in efficient manner. – amit May 01 '15 at 06:06
  • I have a laugh whenever I see "Fisher-Yates". The method is so obvious that's it's probably been "rediscovered" millions of times. Fisher and Yates made great contributions to probability and statistics, but the method that carries their name is not one of them. – Cary Swoveland May 03 '15 at 07:58
  • @CarySwoveland It's completely off topic, but that's how it is called. similarly, Maclaurin series that is called after Maclaurin only because he has used Taylor series around 0 extensively. (Taylor first developed the general series, Maclaurin assigned a=0 - that's all, and used it a lot), and we still call it Maclaurin's series, though his contribution to this series is insignificant. – amit May 03 '15 at 08:15
  • @CarySwoveland Anyway, I no longer believe fisher-yates can be modified here to work as efficiently (or better) as the suggested O(nlogn) solution I showed here, but I might be wrong. – amit May 03 '15 at 08:17
  • I expect Taylor was rolling in his grave when they began calling it "Taylor-Maclaurin". – Cary Swoveland May 03 '15 at 09:14
  • I compared the two standard ways of shuffling. You might find the results interesting. – Cary Swoveland May 03 '15 at 20:42
  • @CarySwoveland Pretty obvious approach1 is much more inefficient. My 2nd approach however would beat both for large enough array (and large W), as it is O(nlogn), compare to O(n^2logn) of the alternative (or so it seems from the "explanation", I'm unfamiliar with ruby) – amit May 03 '15 at 21:07
  • I still don't understand your 2nd approach. Can you at least show a small example of the binary tree? – Cary Swoveland May 03 '15 at 21:19
  • @CarySwoveland Added a detailed explanation of the tree algorithm. I think it is easy to see it is equivalent to the array solution, but significantly more efficient. Proving it is unbiased is by noting that at iteration `i`, for two nodes that are still unchosen with index j,k. the probability that `j` is chosen, given that one of them is chosen is `weight[j]/(weight[j] + weight[k])`. – amit May 04 '15 at 08:53
  • Thank you for a most excellent explanation of the tree algorithm! That is very cool indeed. Aside: what software did you use for the display of the trees? – Cary Swoveland May 04 '15 at 13:50
  • @CarySwoveland Ashamed to say manually drawn in power point. Saw no reason for fancy anything for such a small example. – amit May 04 '15 at 13:55
  • Implementation is much simpler if you don't remove any nodes, just set their weight to zero and update the parent nodes. It will also run in `O(n log(n))`. Please see my answer and code here: http://softwareengineering.stackexchange.com/a/344274/4254 – jbochi Mar 16 '17 at 13:44
2

I would shuffle the array as follows:

Code

def weighted_shuffle(array)
  arr = array.sort_by { |h| -h[:weight] }
  tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
  ndx_left = arr.each_index.to_a
  arr.size.times.with_object([]) do |_,a|
    cum = 0
    rn = (tot_wt>0) ? rand(tot_wt) : 0
    ndx = ndx_left.find { |i| rn <= (cum += arr[i][:weight]) }
    a << arr[ndx]
    tot_wt -= arr[ndx_left.delete(ndx)][:weight]
  end
end

Examples

elements = [
  { :id => "ID_1", :weight => 100 },
  { :id => "ID_2", :weight => 200 },
  { :id => "ID_3", :weight => 600 }
]

def display(arr,n)
  n.times.with_object([]) { |_,a|
    p weighted_shuffle(arr).map { |h| h[:id] } }
end

display(elements,10)
  ["ID_3", "ID_2", "ID_1"]
  ["ID_1", "ID_3", "ID_2"]
  ["ID_1", "ID_3", "ID_2"]
  ["ID_3", "ID_2", "ID_1"]
  ["ID_3", "ID_2", "ID_1"]
  ["ID_2", "ID_3", "ID_1"]
  ["ID_2", "ID_3", "ID_1"]
  ["ID_3", "ID_1", "ID_2"]
  ["ID_3", "ID_1", "ID_2"]
  ["ID_3", "ID_2", "ID_1"]

n = 10_000
pos = elements.each_index.with_object({}) { |i,pos| pos[i] = Hash.new(0) }
n.times { weighted_shuffle(elements).each_with_index { |h,i|
  pos[i][h[:id]] += 1 } }
pos.each { |_,h| h.each_key { |k| h[k] = (h[k]/n.to_f).round(3) } }
  #=> {0=>{"ID_3"=>0.661, "ID_2"=>0.224, "ID_1"=>0.115},
  #    1=>{"ID_2"=>0.472, "ID_3"=>0.278, "ID_1"=>0.251},
  #    2=>{"ID_1"=>0.635, "ID_2"=>0.304, "ID_3"=>0.061}}

This says that, of the 10,000 times weighted_shuffle was called, the first element selected was `"ID_3" 66.1% of the time, "ID_2" 22.4% percent of the time and "ID_1" the remaining 11.5% of the time. "ID_2" was selected second 47.2% of the times, and so on.

Explanation

arr is the array of hashes to be shuffled. The shuffle is performed in arr.size steps. At each step I randomly draw an element of arr, without replacement, using the weights provided. If h[:weight] sums to tot for all elements h of arr that have not been previously selected, the probability of any one of those hashes h being selected is h[:weight]/tot. The selection at each step is done by finding the first cumulative probability p for which rand(tot) <= p. This last step is made more efficient by pre-sorting element's elements by declining weight, which is done in the first step of the method:

elements.sort_by { |h| -h[:weight] }
  #=> [{ :id => "ID_3", :weight => 600 },
  #    { :id => "ID_2", :weight => 200 },
  #    { :id => "ID_1", :weight => 100 }]

This is implemented using an array of indices of arr, called ndx_left, over which the iteration is performed. After a hash h at index i is selected, tot is updated by subtracting h[:weight] and i is deleted from ndx_left.

Variant

The following is a variant of the method above:

def weighted_shuffle_variant(array)
   arr = array.sort_by { |h| -h[:weight] }
   tot_wt = arr.reduce(0) { |t,h| t += h[:weight] }
   n = arr.size
   n.times.with_object([]) do |_,a|
     cum = 0
     rn = (tot_wt>0) ? rand(tot_wt) : 0
     h, ndx = arr.each_with_index.find { |h,_| rn <= (cum += h[:weight]) }
     a << h
     tot_wt -= h[:weight]
     arr[ndx] = arr.pop
   end
 end

Rather than maintaining an array of indices of elements in arr which have not yet been selected, arr is modified in place and reduced in size by one when each element is selected. If the element arr[i] is selected, the last element is copied to offset i and the last element of arr is removed:

arr[i] = arr.pop 

Benchmark

The approach of replicating each element h of elements h[:weight] times, shuffling then uniqifying the result is excruciatingly inefficient. If that's not obvious, here's a benchmark. I've compared my weighted_shuffle with @Mori's solution which is representative of the "replicate, shuffle, delete" approach:

def mori_shuffle(array)
  array.flat_map { |h| [h[:id]] * h[:weight] }.shuffle.uniq
end

require 'benchmark'

def test_em(nelements, ndigits)
  puts "\nelements.size=>#{nelements}, weights have #{ndigits} digits\n\n"
  mx = 10**ndigits
  elements = nelements.times.map { |i| { id: i, weight: rand(mx) } }
  Benchmark.bm(15 "mori_shuffle", "weighted_shuffle") do |x|
    x.report { mori_shuffle(elements) }
    x.report { weighted_shuffle(elements) }
  end
end

elements.size=>3, weights have 1 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000068)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000051)

elements.size=>3, weights have 2 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000035)
weighted_shuffle  0.010000   0.000000   0.010000 (  0.000026)

elements.size=>3, weights have 3 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000161)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000027)

elements.size=>3, weights have 4 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000854)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000026)

elements.size=>20, weights have 2 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000089)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000090)

elements.size=>20, weights have 3 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000771)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000071)

elements.size=>20, weights have 4 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.005895)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000073)

elements.size=>100, weights have 2 digits

                      user     system      total        real
mori_shuffle      0.000000   0.000000   0.000000 (  0.000446)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000683)

elements.size=>100, weights have 3 digits

                      user     system      total        real
mori_shuffle      0.010000   0.000000   0.010000 (  0.003765)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000659)

elements.size=>100, weights have 4 digits

                      user     system      total        real
mori_shuffle      0.030000   0.010000   0.040000 (  0.034982)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000638)

elements.size=>100, weights have 5 digits

                      user     system      total        real
mori_shuffle      0.550000   0.040000   0.590000 (  0.593190)
weighted_shuffle  0.000000   0.000000   0.000000 (  0.000623)

elements.size=>100, weights have 6 digits

                      user     system      total        real
mori_shuffle      5.560000   0.380000   5.940000 (  5.944749)
weighted_shuffle  0.010000   0.000000   0.010000 (  0.000636)

Comparison of weighted_shuffle and weighted_shuffle_variant

Considering that the benchmark engine is all warmed up, I may as well compare the two methods I suggested. The results are similar, with weighted_shuffle having a consistent edge. Here are some typical results:

elements.size=>20, weights have 3 digits

                               user     system      total        real
weighted_shuffle           0.000000   0.000000   0.000000 (  0.000062)
weighted_shuffle_variant   0.000000   0.000000   0.000000 (  0.000108)

elements.size=>20, weights have 4 digits

                               user     system      total        real
weighted_shuffle           0.000000   0.000000   0.000000 (  0.000060)
weighted_shuffle_variant   0.000000   0.000000   0.000000 (  0.000089)

elements.size=>100, weights have 2 digits

                               user     system      total        real
weighted_shuffle           0.000000   0.000000   0.000000 (  0.000666)
weighted_shuffle_variant   0.000000   0.000000   0.000000 (  0.000871)

elements.size=>100, weights have 4 digits

                               user     system      total        real
weighted_shuffle           0.000000   0.000000   0.000000 (  0.000625)
weighted_shuffle_variant   0.000000   0.000000   0.000000 (  0.000803)

elements.size=>100, weights have 6 digits

                               user     system      total        real
weighted_shuffle           0.000000   0.000000   0.000000 (  0.000664)
weighted_shuffle_variant   0.000000   0.000000   0.000000 (  0.000773)

As compared to weighted_shuffle, weighted_shuffle_variant does not maintain an array of indices of elements of (a copy of) elements that have not yet been selected (a time-saver). Instead, it replaces the selected element in the array with the last element of the array and then pops the last element, causing the size of the array to decrease by one at each step. Unfortunately, that destroys the ordering of elements by decreasing weight. By contrast, weighted_shuffle maintains the optimization of considering elements by decreasing order of weight. On balance, the latter tradeoff appears to be more important than the former.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

Based on the @amit suggestion:

def self.random_suffle_with_weight(elements, &proc)
  consecutive_chain = []
  elements.each do |element|
    proc.call(element).times { consecutive_chain << element }
  end

  consecutive_chain.shuffle.uniq
end
Community
  • 1
  • 1
fguillen
  • 36,125
  • 23
  • 149
  • 210
1

Weighted Random Sampling (2005; Efraimidis, Spirakis) provides a very elegant algorithm for this. The implementation is super simple and runs in O(n log(n)):

def weigthed_shuffle(items, weights):
    order = sorted(range(len(items)), key=lambda i: -random.random() ** (1.0 / weights[i]))
    return [items[i] for i in order]
jbochi
  • 28,816
  • 16
  • 73
  • 90
0

I have my solution but I think it can be improved:

module Utils
  def self.random_suffle_with_weight(elements, &proc)
    # Create a consecutive chain of element
    # on which every element is represented
    # as many times as its weight.
    consecutive_chain = []
    elements.each do |element|
      proc.call(element).times { consecutive_chain << element }
    end

    # Choosine one element randomly from
    # the consecutive_chain and remove it for the next round
    # until all elements has been chosen.
    shorted_elements = []
    while(shorted_elements.length < elements.length)
      random_index = Kernel.rand(consecutive_chain.length)
      selected_element = consecutive_chain[random_index]
      shorted_elements << selected_element
      consecutive_chain.delete(selected_element)
    end

    shorted_elements
  end
end

Test:

def test_random_suffle_with_weight
  element_1 = { :id => "ID_1", :weight => 10 }
  element_2 = { :id => "ID_2", :weight => 20 }
  element_3 = { :id => "ID_3", :weight => 60 }
  elements = [element_1, element_2, element_3]

  Kernel.expects(:rand).with(90).returns(11)
  Kernel.expects(:rand).with(70).returns(1)
  Kernel.expects(:rand).with(60).returns(50)

  assert_equal([element_2, element_1, element_3], Utils.random_suffle_with_weight(elements) { |e| e[:weight] })
end
fguillen
  • 36,125
  • 23
  • 149
  • 210
  • 1
    Not sure why you are reinventing the wheel for "chose an item at a time" after creating the modified list with duplicates. It would be better to just (1) shuffle the list with fisher-yates, (2) discard duplicates and keep only first occurance of each. I believe there should be an even better approach than this, since it runs in pseudo-polynomial time, depending on the weights. – amit Apr 30 '15 at 16:18
  • @amit I've taken your suggestion to build another answer ;) – fguillen Apr 30 '15 at 16:26
  • 1
    Rather than create multiple answers, add that content to this one, explain what and why you changed it, and delete the other one. Creating multiple answers is confusing to anyone else trying to solve the same problem you are. – the Tin Man Apr 30 '15 at 18:23
  • You can improve efficiency by first dividing each weight by the greatest common divisor of all weights. See my comment on @Mori's answer. – Cary Swoveland May 02 '15 at 18:52
  • @CarySwoveland Not sure what you mean, looks like implementation detail to me, what I had in mind (solution 1 in my answer, which is basically extend the list to have each element as many dupes as its weight, than shuffle with fisher yates, and then discard all elements that are duplicates retaining only the FIRST element) does not rely on hashing. You can use hashing for the last step of removing dupes, but the point is to keep only the first element eitherway. Not sure how would hashing affect correctness of the approach. – amit May 03 '15 at 07:36
  • @CarySwoveland I think you are referring to a shuffle that is based on `aux[i] = hash(arr[i])`, and then sort(arr) according to value of aux. This is suboptimal shuffling, fisher yates is better in both performance and correctness – amit May 03 '15 at 07:37
  • @amit, my apologies for the remark about duplicate hashes. I thought I had removed that. (I was referring to duplicate values of `array`, but of course there would never be any.) With regard to the use of `uniq`, as you say, that removes all but the first instance. If the one it kept was chosen randomly, I would expect the shuffle to be unbiased. It may be unbiased by removing the first, but I don't know that to be a fact. I would want a proof that it results in an unbiased shuffle before using it. – Cary Swoveland May 03 '15 at 07:55
  • @CarySwoveland In guidelines, for every two elements with indices i,j- the probability of i being before j is `weight[i]/(weight[i] + weight[j])`. Proof is by looking at the sublist, which is randomly shuffled of only arr[i] and arr[j]. There are weight[i] of arr[i] and weight[j] of arr[j]. The probability of the first element to be one of arr[i] is exactly weight[i]/(weight[i] + weight[j]). Since all occurances of arr[j] come after this occurance, it is guaranteed that indeed arr[i] comes before arr[j] with p=weight[i]/(weight[i] + weight[j]) – amit May 03 '15 at 08:12
  • @amit, I've seen the error of my ways; the method clearly generates unbiased shuffles. – Cary Swoveland May 03 '15 at 09:08
0
elements.flat_map { |h| [h[:id]] * h[:weight] }.shuffle.uniq
Mori
  • 27,279
  • 10
  • 68
  • 73
  • 1
    You should avoid code only answers, especially when the 'algorithm' tag is mentioned, as it indicates the OP is not only interested on "getting it done", but also on "how is it done?" – amit Apr 30 '15 at 17:01
  • Don't toss a fish to someone, instead, teach them how to fish. – the Tin Man Apr 30 '15 at 18:21
  • You can reduce the size of the expanded array by first dividing each weight by the greatest common divisor of all weights, which you can obtain as follows: `wts = [650, 200, 350, 100]; gcd = wts.reduce(:gcd) #=> 50; wts.map { |w| w/gcd } #=> [13, 4, 7, 2]`. – Cary Swoveland May 03 '15 at 09:05