Random based on area

Question

I have an array of elements:

$arr = array(
  '0' => 265000, // Area
  '1' => 190000,
  '2' => 30000,
  '3' => 1300
);

I want to get random index based on the area (Array value). I need the area with big value be selected more frequently. How can I do this?

What I have now:

$random_idx = mt_rand(0, count($arr)-1);    
$selected_area = (object)$arr[$random_idx];

Thanks!

What does "based on the area" mean? It's not really clear what you're trying to do here. — Ryan Bemrose, May 13 '16 at 17:12
Are those values random weighting? Meaning you want the array to select index `0` 265000 times for every 1300 times it selects index `3`? — Ryan Bemrose, May 13 '16 at 17:23

Bob__ · Answer 1 · 2016-05-21T15:11:27.160

1. Repeted values

Let's suppose we have an array in which every value corresponds to the relative probability of its index. For example, given a coin, the possible outcomes of a toss are 50% tails and 50% heads. We can represent those probability with an array, like (I'll use PHP as this seems the language used by OP):

$coin = array(     
     'head' => 1,    
     'tails' => 1    
);

While the results of rolling two dice can be represented as:

$dice = array( '2' => 1, '3' => 2, '4' => 3, '5' => 4, '6' => 5, '7' => 6,
               '8' => 5, '9' => 4, '10' => 3, '11' => 2, '12' => 1
);

An easy way to pick a random key (index) with a probability proportional to the values of those arrays (and therefore consistent to the underlying model) is to create another array whose elements are the keys of the original one repeated as many times as indicated by the values and then return a random value. For example for the dice array:

$arr = array( 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, ...

Doing so, we are confident that each key will be picked up with the right relative probability. We can encapsulate all the logic in a class with a constructer which builds the helper array an a function that returns a random index using mt_rand():

class RandomKeyMultiple {
    private $pool = array();
    private $max_range;

    function __construct( $source ) {
        // build the look-up array
        foreach ( $source as $key => $value ) {
            for ( $i = 0; $i < $value; $i++ ) {
                $this->pool[] = $key;
            }
        }
        $this->max_range = count($this->pool) - 1;
    }

    function get_random_key() {
        $x = mt_rand(0, $this->max_range);

        return $this->pool[$x];     
    }
}

The usage is simple, just create an object of the class passing the source array and then each call of the function will return a random key:

$test = new RandomKeyMultiple($dice);
echo $test->get_random_key();

The problem is that OP's array contains big values and this results in a very big (but still manageable, even without dividing all the values by 100) array.

2. Steps

In general, discrete probability distribution may be more complicated, with float values that cannot be easily translated in number of repetitions.

Another way to solve the problem is to consider the values in the array as the misures of intervals that divide the global range of all possible values:

    +---------------------------+-----------------+-------+----+
    |                           |                 |       |    |
    |<---       265000      --->|<--   190000  -->|<30000>|1300| 
    |<-------            455000            ------>|            |
    |<----------              485000            --------->|    |
    |<----------------            486300        -------------->|

Then we can choose a random number between 0 and 486300 (the global range) and look up the right index (the odds of which would be proportional to the lenght of its segment, giving the correct probability distribution). Something like:

$x = mt_rand(0, 486300);
if ( $x < 265000 )
    return 0;
elseif ( $x < 455000 )
    return 1;
elseif ( $x < 485000 )
    return 2;
else
    return 3;

We can generalize the algorithm and encapsulate all the logic in a class (using an helper array to store the partial sums):

class RandomKey {
    private $steps = array();
    private $last_key;
    private $max_range;

    function __construct( $source ) {
        // sort in ascending order to partially avoid numerical issues
        asort($source);  

        // calculate the partial sums. Considering OP's array:
        //
        //   1300 ---->       0
        //  30000 ---->    1300
        // 190000 ---->   31300
        // 265000 ---->  221300  endind with $partial = 486300
        //
        $partial = 0;
        $temp = 0;
        foreach ( $source as $k => &$v ) {
            $temp = $v;
            $v = $partial;
            $partial += $temp;
        }

        // scale the steps to cover the entire mt_rand() range
        $factor = mt_getrandmax() / $partial;
        foreach ( $source as $k => &$v ) {
            $v *= $factor;
        }

        // Having the most probably outcomes first, minimizes the look-up of
        // the correct index
        $this->steps = array_reverse($source);

        // remove last element (don't needed during checks) but save the key
        end($this->steps);
        $this->last_key = key($this->steps);
        array_pop($this->steps);
    }

    function get_random_key() {
        $x = mt_rand();

        foreach ( $this->steps as $key => $value ) {
            if ( $x > $value ) {
                return $key;
            }
        }
        return $this->last_key;     
    }

}

Here or here there are live demos with some examples and helper functions to check the probability distribution of the keys.

For bigger arrays, a binary search to look-up the index may also be considered.

Oleg Mikhailov · Answer 2 · 2016-05-22T10:30:52.097

This solution is based on element's index, not on it's value. So we need the array to be ordered to always be sure that element with bigger value has bigger index.

Random index generator can now be represented as a linear dependency x = y:

(y)

a i     4             +
r n     3          +
r d     2       +
a e     1    +
y x     0 +
          0  1  2  3  4      

          r a n d o m
          n u m b e r   (x)

We need to generate indices non-linearly (bigger index - more probability):

a i     4                               +  +  +  +  +
r n     3                   +  +  +  +
r d     2          +  +  +
a e     1    +  + 
y x     0 +  
          0  1  2  3  4  5  6  7  8  9 10 11 12 13 14

          r a n d o m
          n u m b e r

To find the range of x values for an array of length c we can calculate the sum of all numbers in range 0..c:

(c * (c + 1)) / 2;

To find x for any y let's solve quadratic equation

y ^ 2 + y - 2 * x = 0;

Having solved this we get

y = (sqrt(8 * x + 1) - 1) / 2;

Now let's put it all together:

$c = $count($arr);
$range = ($c * ($c + 1)) / 2;
$random_x = mt_rand(0, range);
$random_idx = floor((sqrt(8 * $random_x + 1) - 1) / 2);

This solution fits best for big arrays in terms of performance - it does not depend on the array size and type.

If I understand correctly, your solution requires us to first find a function describing the non-linear mapping between the random number and the index. How do you programmatically find such a function for a given array - via interpolation? Evaluating such an interpolation function might ruin the performance advantage of your approach though... — le_m, May 21 '16 at 23:37
General idea is: 1) defining certain algebraic function returning random generated index from certain range; 2) find the range of the function; 3) generate random number from the range; 4) pass the generated number into the function and obtain array index. — Oleg Mikhailov, May 22 '16 at 10:10
For this implementation the function `y = (floor(sqrt(8 * x + 1) - 1) / 2)` is used. It works just like it is drawn on the second graph. Any other function can be used, you only need to find a correct range for it. Just copy the four last lines from the answer - it should work for you. — Oleg Mikhailov, May 22 '16 at 10:23

score 0 · Answer 3 · edited May 23 '17 at 11:58

Your array describes a discrete probability distribution. Each array value ('area' or 'weight') relates to the probability of a discrete random variable taking a specific value from the range of array keys.

/**
 * Draw a pseudorandom sample from the given discrete probability distribution.
 * The input array values will be normalized and do not have to sum up to one.
 *
 * @param array $arr Array of samples => discrete probabilities (weights).
 * @return sample
 */
function draw_discrete_sample($arr) {
    $rand = mt_rand(0, array_sum($arr) - 1);
    foreach ($arr as $key => $weight) {
        if (($rand -= $weight) < 0) return $key;
    }
}

Replace the first line with $rand = mt_rand() / mt_getrandmax() * array_sum($arr); if you want to support non-integer weights / probabilities.

You might also want to have a look at similar questions asked here. If you are only interested in sampling a small set of known distributions, I recommend the analytic approach outlined by Oleg Mikhailov.

score 0 · Answer 4 · 2016-05-22T13:24:56.967

This problem is somewhat similar to the way operating systems can identify the next thread to run with lottery scheduling.

The idea is to assign each area a number of tickets depending on its size and number all those tickets. Depending on which random number was chosen you know which ticket won and thus the winning area.

First you will need to sum up all the areas and find a random number up to this total. Now you just iterate through your array and look for the first element whose summed up total to this point is larger than the random number.

Assuming you are looking for a solution in PHP:

function get_random_index($array) {
    // generate total
    $total = array_sum($array);
    // get a random number in the required range
    $random_number = rand(0, $total-1);
    // temporary sum needed to find the 'winning' area
    $temp_total = 0;
    // this variable helps us identify the winning area
    $current_area_index = 0;

    foreach ($array as $area) {
        // add the area to our temporary total
        $temp_total = $temp_total + $area;

        // check if we already have the right ticket
        if($temp_total > $random) {
            return $current_area_index;
        }
        else {
            // this area didn't win, so check the next one
            $current_area_index++;
        }
    }
}

Random based on area

4 Answers4