Weighted random pick

Question

I have a set of items. I need to randomly pick one. The problem is that they each have a weight of 1-10. A weight of 2 means that the item is twice as likely to be picked than a weight of 1. A weight of 3 is three times as likely.

I currently fill an array with each item. If the weight is 3, I put three copies of the item in the array. Then, I pick a random item.

My method is fast, but uses a lot of memory. I am trying to think of a faster method, but nothing comes to mind. Anyone have a trick for this problem?

EDIT: My Code...

Apparently, I wasn't clear. I do not want to use (or improve) my code. This is what I did.

//Given an array $a where $a[0] is an item name and $a[1] is the weight from 1 to 100.
$b = array();
foreach($a as $t)
    $b = array_merge($b, array_fill(0,$t[1],$t));
$item = $b[array_rand($b)];

This required me to check every item in $a and uses max_weight/2*size of $a memory for the array. I wanted a COMPLETELY DIFFERENT algorithm.

Further, I asked this question in the middle of the night using a phone. Typing code on a phone is nearly impossible because those silly virtual keyboards simply suck. It auto-corrects everything, ruining any code I type.

An yet further, I woke up this morning with an entirely new algorithm that uses virtual no extra memory at all and does not require checking every item in the array. I posted it as an answer below.

Average case is about 100,000 items with average weight of 5. That means that I make a 500.000 element array every time I want to pick a random item. — kainaw, Apr 27 '15 at 03:48
Keep the items in the database, when you pick one randomly, pull only that one item out of the database. That way you don't need the items in an array. — kojow7, Apr 27 '15 at 03:50
I challenge one to find a more memory efficient and faster way than my answer (evil laugh). Basically you do the calculation on an array of weights. — Devon Bessemer, Apr 27 '15 at 05:11
i have one thats better, but i can't post it. its an array of weight intervals. — pala_, Apr 27 '15 at 05:14
Yeah, I don't see it as off-topic. He explained what he was doing even if he didn't provide code. I suppose he just has to put his original code and it can be re-opened. — Devon Bessemer, Apr 27 '15 at 05:18
It's not off-topic, but it is a multi-duplicate (e.g. [1](http://stackoverflow.com/questions/1761626/weighted-random-numbers), [2](http://stackoverflow.com/questions/4726281/picking-random-element-by-user-defined-weights?rq=1), etc). Should be reopened so it can be closed correctly. — Alex Celeste, Apr 27 '15 at 06:09
@kainaw I would like to offer a new, lean, competitve answer. But for me to ensure it is the leanest it can be, I need to know how this input array is being formed to begin with. Is it in a database table? a file? hardcoded array in the script itself? Please update your question with this information and I should be able to offer a solid answer. — mickmackusa, Apr 17 '17 at 08:39
@mickmackusa I could have pulled the original data from just about anywhere. There are multiple times that I need a specific distribution of random data. For example, I might need to assign blood pressures to a test population such that the distribution has a mean of 140 with a SD of 20 on the left of mean and 35 on the right of mean. I calculate that complete distribution and then randomly assign blood pressures to the population. — kainaw, Apr 19 '17 at 15:48

pala_ · Answer 1 · 2015-04-27T10:29:26.057

5

This ones your huckleberry.

  $arr = array(
    array("val" => "one", "weight" => 1),
    array("val" => "two", "weight" => 2),
    array("val" => "three", "weight" => 3),
    array("val" => "four", "weight" => 4)
  );

  $weight_sum = 0;
  foreach($arr as $val)
  {
    $weight_sum += $val['weight'];
  }

  $r = rand(1, $weight_sum);
  print "random value is $r\n";

  for($i = 0; $i < count($arr); $i++)
  {
    if($r <= $arr[$i]['weight'])
    {
      print "$r <= {$arr[$i]['weight']}, this is our match\n";
      print $arr[$i]['val'] . "\n";
      break;
    }
    else
    {
      print "$r > {$arr[$i]['weight']}, subtracting weight\n";
      $r -= $arr[$i]['weight'];
      print "new \$r is $r\n";
    }
  }

No need to generate arrays containing an item for every weight, no need to fill an array with n elements for a weight of n. Just generate a random number between 1 and total weight, then loop through the array until you find a weight less than your random number. If it isn't less than the number, subtract that weight from the random and continue.

Sample output:

# php wr.php
random value is 8
8 > 1, subtracting weight
new $r is 7
7 > 2, subtracting weight
new $r is 5
5 > 3, subtracting weight
new $r is 2
2 <= 4, this is our match
four

This should also support fractional weights.

modified version to use array keyed by weight, rather than by item

  $arr2 = array(
  );

  for($i = 0; $i <= 500000; $i++)
  {
    $weight = rand(1, 10);
    $num = rand(1, 1000);
    $arr2[$weight][] = $num;
  }

  $start = microtime(true);

  $weight_sum = 0;
  foreach($arr2 as $weight => $vals) {
    $weight_sum += $weight * count($vals);
  }

  print "weighted sum is $weight_sum\n";

  $r = rand(1, $weight_sum);
  print "random value is $r\n";
  $found = false;
  $elem = null;

  foreach($arr2 as $weight => $vals)
  {
    if($found) break;
    for($j = 0; $j < count($vals); $j ++)
    {
      if($r < $weight)
      {
        $elem = $vals[$j];
        $found = true;
        break;
      }
      else
      {
        $r -= $weight;
      }
    }
  }
  $end = microtime(true);

  print "random element is: $elem\n";
  print "total time is " . ($end - $start) . "\n";

With sample output:

# php wr2.php
weighted sum is 2751550
random value is 345713
random element is: 681
total time is 0.017189025878906

measurement is hardly scientific - and fluctuates depending on where in the array the element falls (obviously) but it seems fast enough for huge datasets.

edited Apr 27 '15 at 10:29

answered Apr 27 '15 at 09:11

pala_

8,901
1
15
32

Interesting approach, using about 1/2 of the memory as written and accurate, but is it fast? For this, you'd be doing up to 1 million additions/substractions as opposed to generating a random number. – Devon Bessemer Apr 27 '15 at 09:47
still doing fewer times through the loop, and computers are pretty good at addition – pala_ Apr 27 '15 at 09:49
Yes, it'd be an interesting benchmark. The reason I think mine is faster is because it's only looping through weights (10) as opposed to every item (500000) – Devon Bessemer Apr 27 '15 at 09:52
well i guess that depends on the input - but the above can be converted to handle an array keyed by weight easily enough. – pala_ Apr 27 '15 at 09:56
Yeah that would also reduce memory usage. – Devon Bessemer Apr 27 '15 at 09:56
I'd question your benchmark. the above runs in under half a second when run on a dataset of half a million possible values spread over 10 weights - to my measurements. (excluding time to generate the data) – pala_ Apr 27 '15 at 10:27
How long did it take for mine on the same processor? I ran it on an Atom dev box. Slow processor, but still completed mine in 0.001 seconds on the same generated array. – Devon Bessemer Apr 27 '15 at 10:29
Your new code is much more efficient, takes 0.5 seconds as opposed to 15 seconds. +1. I'd remove the top code all together. – Devon Bessemer Apr 27 '15 at 10:40
yeah. re-arranged the datastructure a little more. anyway your effort still cranks out at about the same as on your box - i'm not 100% convinced yet of the accuracy (still working through it) but its a random element....probably not a huge deal – pala_ Apr 27 '15 at 10:41
The bottom code should be pretty accurate. The top code won't be on disproportionate counts, but the bottom code solves that using proportional weighting based on the counts. – Devon Bessemer Apr 27 '15 at 10:44
However, yours would probably be faster with an outlier. For example, if there was a weight with one element using the bottom code, as written, it would use more memory and probably be much slower. – Devon Bessemer Apr 27 '15 at 10:45
the proportional weighting still rounds off and introduces a few areas for discrepancy. so it's a little fuzzier in that respect - eg, lets say there are 49919 entries for weight 10 (where weight = 1..10). That's about 9.98% of the weights, yours rounds that to 10%. still, op has a load of options to choose from now. up to him :) – pala_ Apr 27 '15 at 10:53
Thank you for the work here comparing the answer from Devon to the answer from pala_. In my generalized view, Devon's is similar to my solution (which I added to the question). The speed is dependent on how long it takes to create the big array. Small arrays are created extremely fast. Massive arrays slow down. The answer from pala_ is an interval solution. I previously tried a slight variation. I tried a similar approach. It was faster, but not as fast as pala_'s solution. Then, while sleeping, I had an idea for a new solution that I posted as a new answer. – kainaw Apr 27 '15 at 13:21
actually, this could be further improved if the weight total was calculated when the data was assembled - but obviously that depends on the scenario – pala_ Apr 27 '15 at 13:43

Devon Bessemer · Answer 2 · 2015-04-27T10:09:25.350

This way requires two random calculations but they should be faster and require about 1/4 of the memory but with some reduced accuracy if weights have disproportionate counts. (See Update for increased accuracy at the cost of some memory and processing)

Store a multidimensional array where each item is stored in the an array based on its weight:

$array[$weight][] = $item;
// example: Item with a weight of 5 would be $array[5][] = 'Item'

Generate a new array with the weights (1-10) appearing n times for n weight:

foreach($array as $n=>$null) {
  for ($i=1;$i<=$n;$i++) {
    $weights[] = $n;
  }
}

The above array would be something like: [ 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ... ]

First calculation: Get a random weight from the weighted array we just created

$weight = $weights[mt_rand(0, count($weights)-1)];

Second calculation: Get a random key from that weight array

$value = $array[$weight][mt_rand(0, count($array[$weight])-1)];

Why this works: You solve the weighted issue by using the weighted array of integers we created. Then you select randomly from that weighted group.

Update: Because of the possibility of disproportionate counts of items per weight, you could add another loop and array for the counts to increase accuracy.

foreach($array as $n=>$null) {
  $counts[$n] = count($array[$n]);
}

foreach($array as $n=>$null) {
  // Calculate proportionate weight (number of items in this weight opposed to minimum counted weight)
  $proportion = $n * ($counts[$n] / min($counts));
  for ($i=1; $i<=$proportion; $i++) {
    $weights[] = $n;
  }
}

What this does is if you have 2000 10's and 100 1's, it'll add 200 10's (20 * 10, 20 because it has 20x the count, and 10 because it is weighted 10) instead of 10 10's to make it proportionate to how many are in there opposed the minimum weight count. So to be accurate, instead of adding one for EVERY possible key, you are just being proportionate based on the MINIMUM count of weights.

What if some weight doesn't exists? E.g. If the origin set doesn't have item with weight `5`, and you picked `$weight=5`, the next step will generate error. Of course with 100000 items this scene is rare, but still. — Passerby, Apr 27 '15 at 06:04
Also, with a set of `[1,1,2,3,...10]` (in the sense of weight), OP's implementation should select each `1` with 1/56 probability, and the `10` with 10/56 probability. In your implementation, you would select each `1` with 1/55*1/2=1/110 probability, and the `10` with 10/55*1/1=10/55 probability. — Passerby, Apr 27 '15 at 06:13
@Passerby, that is a good point that I didn't think of. I guess in that sense, you would need to loop through the first level of $array to determine $n instead of a for loop. — Devon Bessemer, Apr 27 '15 at 06:45
@Passerby, I believe my edits fix accuracy issues fairly well. — Devon Bessemer, Apr 27 '15 at 07:27

kainaw · Answer 3 · 2015-04-27T16:34:43.007

I greatly appreciate the answers above. Please consider this answer, which does not require checking every item in the original array.

// Given $a as an array of items
// where $a[0] is the item name and $a[1] is the item weight.
// It is known that weights are integers from 1 to 100.
for($i=0; $i<sizeof($a); $i++) // Safeguard described below
{
    $item = $a[array_rand($a)];
    if(rand(1,100)<=$item[1]) break;
}

This algorithm only requires storage for two variables ($i and $item) as $a was already created before the algorithm kicked in. It does not require a massive array of duplicate items or an array of intervals.

In a best-case scenario, this algorithm will touch one item in the original array and be done. In a worst-case scenario, it will touch n items in an array of n items (not necessarily every item in the array as some may be touched more than once).

If there was no safeguard, this could run forever. The safeguard is there to stop the algorithm if it simply never picks an item. When the safeguard is triggered, the last item touched is the one selected. However, in millions of tests using random data sets of 100,000 items with random weights of 1 to 10 (changing rand(1,100) to rand(1,10) in my code), the safeguard was never hit.

I made histograms comparing the frequency of items selected among my original algorithm, the ones from answers above, and the one in this answer. The differences in frequencies are trivial - easy to attribute to variances in the random numbers.

EDIT... It is apparent to me that my algorithm may be combined with the algorithm pala_ posted, removing the need for a safeguard.

In pala_'s algorithm, a list is required, which I call an interval list. To simplify, you begin with a random_weight that is rather high. You step down the list of items and subtract the weight of each one until your random_weight falls to zero (or less). Then, the item you ended on is your item to return. There are variations on this interval algorithm that I've tested and pala_'s is a very good one. But, I wanted to avoid making a list. I wanted to use only the given weighted list and never touch all the items. The following algorithm merges my use of random jumping with pala_'s interval list. Instead of a list, I randomly jump around the list. I am guaranteed to get to zero eventually, so no safeguard is needed.

// Given $a as the weighted array (described above)
$weight = rand(1,100); // The bigger this is, the slower the algorithm runs.
while($weight>0)
{
    $item = $a[array_rand($a)];
    $weight-= $item[1];
}
// $item is the random item you want.

I wish I could select both pala_ and this answer as the correct answers.

As you say, in the worst case scenario it could (very unlikely) run forever - and with the safeguard, it is possible this will not return a value at all. Consider the simple example (and i admit i haven't touched probabilities in nearly 20 years). item 1, weight 30. item 2, weight 70. There's a 50% chance of either of them getting chosen, followed by either a 70% or 30% chance of the weight condition being met. Surely this means that there's a 35% chance of weight 70 being chosen, a 15% chance of weight 30 being chosen, leaving a 50% chance of having to choose again? — pala_, Apr 27 '15 at 13:36
If the safeguard is hit, the last value selected is the one returned. However, I haven't hit the safeguard yet with real data sets. As for your example, there is a 50% chance in each round of an item being chosen. That means that in the set of 2 items (which is too small for the intention of this data set), there is a 75% chance that an item will be selected based on weight and a 25% chance that an item will be selected at random. As the data set gets larger, the chance of hitting the safeguard shrinks. (continued...) — kainaw, Apr 27 '15 at 14:14
(...continued) Consider your example without a safeguard. There is a 50% chance of picking an item in the first round. There is a 75% chance of selecting an item after the second round. There is an 87.5% chance of selecting an item after the third round. There is a 93.75% chance of selecting an item after the fourth round. As you can see, if I have a data set of 100,000 items, will pick an item rather quickly and the chance of hitting the safeguard is very small. — kainaw, Apr 27 '15 at 14:17
Oh I have no doubt as the data grows the chance of failure is minimal, its actually worse on the smaller scale. — pala_, Apr 27 '15 at 14:20
I strongly agree. With a smaller data set, I would not use this algorithm. I only use this for extremely large data sets. I like your solution as well. If I alter it to greatly limit how many items I touch, I find that high weight items following a series of low weight items are favored - which is not the case if I don't try to limit how many items I touch. That is why I considered randomly jumping all over the list - which is where this algorithm came from. — kainaw, Apr 27 '15 at 14:24
I can't get either of your codes to output a selected value. "It is known that weights are integers from 1 to 100." Why is that? Your question doesn't say that. Can you provide an online demo with a sample array so I can see how your code performs? — mickmackusa, Apr 23 '17 at 23:00

Kancho Iliev · Answer 4 · 2015-04-27T04:55:50.077

ere is my offer in case I've understand you right. I offer you take a look and if there are some question I'll explain. Some words in advance:

My sample is with only 3 stages of weight - to be clear - With outer while I'm simulating your main loop - I count only to 100. - The array must to be init with one set of initial numbers as shown in my sample. - In every pass of main loop I get only one random value and I'm keeping the weight at all.

<?php
$array=array(
    0=>array('item' => 'A', 'weight' => 1),
    1=>array('item' => 'B', 'weight' => 2),
    2=>array('item' => 'C', 'weight' => 3),
);
$etalon_weights=array(1,2,3);
$current_weights=array(0,0,0);
$ii=0;
while($ii<100){ // Simulates your main loop
    // Randomisation cycle
    if($current_weights==$etalon_weights){
        $current_weights=array(0,0,0);
    }
    $ft=true;
    while($ft){
        $curindex=rand(0,(count($array)-1));
        $cur=$array[$curindex];
        if($current_weights[$cur['weight']-1]<$etalon_weights[$cur['weight']-1]){
            echo $cur['item'];
            $array[]=$cur;
            $current_weights[$cur['weight']-1]++;
            $ft=false;
        }
    }
    $ii++;
}
?>

I'm refusing of my answer, because @Devon's way is fastest and simplest. — Kancho Iliev, Apr 27 '15 at 06:33

score 0 · Answer 5 · answered Apr 27 '15 at 04:43

I'm not sure if this is "faster", but I think it may be more "balance"d between memory usage and speed.

The thought is to transform your current implementation (500000 items array) into an equal-length array (100000 items), with the lowest "origin" position as key, and origin index as value:

<?php
$set=[["a",3],["b",5]];
$current_implementation=["a","a","a","b","b","b","b","b"];
// 0=>0 means the lowest "position" 0
// points to 0 in the set;
// 3=>1 means the lowest "position" 3
// points to 1 in the set;
$my_implementation=[0=>0,3=>1];

And then randomly picks a number between 0 and highest "origin" position:

// 3 is the lowest position of the last element ("b")
// and 5 the weight of that last element
$my_implemention_pick=mt_rand(0,3+5-1);

Full code:

<?php
function randomPickByWeight(array $set)
{
    $low=0;
    $high=0;
    $candidates=[];
    foreach($set as $key=>$item)
    {
        $candidates[$high]=$key;
        $high+=$item["weight"];
    }
    $pick=mt_rand($low,$high-1);
    while(!array_key_exists($pick,$candidates))
    {
        $pick--;
    }
    return $set[$candidates[$pick]];
}
$cache=[];
for($i=0;$i<100000;$i++)
{
    $cache[]=["item"=>"item {$i}","weight"=>mt_rand(1,10)];
}
$time=time();
for($i=0;$i<100;$i++)
{
    print_r(randomPickByWeight($cache));
}
$time=time()-$time;
var_dump($time);

3v4l.org demo
3v4l.org have some time limitation on codes, so the demo didn't finished. On my laptop the above demo finished in 10 seconds (i7-4700 HQ)

mickmackusa · Answer 6 · 2017-04-23T15:27:20.323

I'll use this input array for my explanation:

$values_and_weights=array(
    "one"=>1,
    "two"=>8,
    "three"=>10,
    "four"=>4,
    "five"=>3,
    "six"=>10
);

The simple version isn't going to work for you because your array is so large. It requires no array modification but may need to iterate the entire array, and that's a deal breaker.

/*$pick=mt_rand(1,array_sum($values_and_weights));
$x=0;
foreach($values_and_weights as $val=>$wgt){
    if(($x+=$wgt)>=$pick){
        echo "$val";
        break;
    }
}*/

For your case, re-structuring the array will offer great benefits. The cost in memory for generating a new array will be increasingly justified as:

array size increases and
number of selections increases.

The new array requires the replacement of "weight" with a "limit" for each value by adding the previous element's weight to the current element's weight.

Then flip the array so that the limits are the array keys and the values are the array values.

The selection logic is: the selected value will have the lowest limit that is >= $pick.

// Declare new array using array_walk one-liner:
array_walk($values_and_weights,function($v,$k)use(&$limits_and_values,&$x){$limits_and_values[$x+=$v]=$k;});

//Alternative declaration method - 4-liner, foreach() loop:
/*$x=0;
foreach($values_and_weights as $val=>$wgt){
    $limits_and_values[$x+=$wgt]=$val;
}*/
var_export($limits_and_values);

$limits_and_values looks like this:

array (
  1 => 'one',
  9 => 'two',
  19 => 'three',
  23 => 'four',
  26 => 'five',
  36 => 'six',
)

Now to generate the random $pick and select the value:

// $x (from walk/loop) is the same as writing: end($limits_and_values); $x=key($limits_and_values);
$pick=mt_rand(1,$x);  // pull random integer between 1 and highest limit/key
while(!isset($limits_and_values[$pick])){++$pick;}  // smallest possible loop to find key
echo $limits_and_values[$pick];  // this is your random (weighted) value

This approach is brilliant because isset() is very fast and the maximum number of isset() calls in the while loop can only be as many as the largest weight (not to be confused with limit) in the array.

FOR YOUR CASE, THIS APPROACH WILL FIND THE VALUE IN 10 ITERATIONS OR LESS!

Here is my Demo that will accept a weighted array (like $values_and_weights), then in just four lines:

Restructure the array,
Generate a random number,
Find the correct value, and
Display it.

@pala_ since our speed tests only measure the time to select the random value and not the array creation itself, I believe my method is 100's (if not thousands) of times faster than yours. Can you please look over my method and tell me what you think? https://3v4l.org/FnTBj — mickmackusa, Apr 24 '17 at 02:40

Weighted random pick

6 Answers6