42

I've got a MySQL table with a bunch of entries in it, and a column called "Multiplier." The default (and most common) value for this column is 0, but it could be any number.

What I need to do is select a single entry from that table at random. However, the rows are weighted according to the number in the "Multiplier" column. A value of 0 means that it's not weighted at all. A value of 1 means that it's weighted twice as much, as if the entry were in the table twice. A value of 2 means that it's weighted three times as much, as if the entry were in the table three times.

I'm trying to modify what my developers have already given me, so sorry if the setup doesn't make a whole lot of sense. I could probably change it but want to keep as much of the existing table setup as possible.

I've been trying to figure out how to do this with SELECT and RAND(), but don't know how to do the weighting. Is it possible?

Peter O.
  • 32,158
  • 14
  • 82
  • 96
John
  • 421
  • 1
  • 4
  • 3
  • "As if the entry were in the table twice" sounds like a good starting point. Repeat each row `Multiplier` times, and do the random selection as you usually would. – bzlm Mar 10 '10 at 14:35
  • 1
    When you say "repeat each row" what do you mean? – John Mar 10 '10 at 14:47

11 Answers11

48

This guy asks the same question. He says the same as Frank, but the weightings don't come out right and in the comments someone suggests using ORDER BY -LOG(1.0 - RAND()) / Multiplier, which in my testing gave pretty much perfect results.

(If any mathematicians out there want to explain why this is correct, please enlighten me! But it works.)

The disadvantage would be that you couldn't set the weighting to 0 to temporarily disable an option, as you would end up dividing by zero. But you could always filter it out with a WHERE Multiplier > 0.

Ben
  • 515
  • 5
  • 18
limos
  • 1,526
  • 12
  • 13
  • 5
    `1 - RAND()` is equivalent to `RAND()`, which is (ideally) Uniform between 0 and 1. `-LOG(RAND())/weight` is Exponential with rate `weight`. Think of an Expo as the time from now until you get an email of a particular kind, and the rate is how fast each kind of email arrives. `LIMIT 1` just picks out the next email. – Ken Arnold May 09 '13 at 23:31
  • Brilliant! I modified this to weight towards an aggregate value from a related table. SELECT l.name, COUNT(l.id) FROM consignments c INNER JOIN locations l ON c.current_location_id = l.id GROUP BY l.id ORDER BY -LOG(RAND()) / COUNT(l.id) DESC – khany Nov 26 '14 at 09:17
  • 2
    Does this solution mean that the OP has to change their multiplier logic slightly? They originally said a multiplier of `0` indicates it is not weighted, but your solution means a multiplier of `0` is excluded from the result set. The OP would have to change their logic slightly so that a multiplier of `1` means not weighted, `2` means it's in the table twice, etc. This seems to make more sense anyway, but just wanted to confirm the change is necessary. – flyingL123 Jun 29 '15 at 13:01
  • 3
    @flyingL123 true, good point. Or they could replace `Multiplier` with `Multiplier + 1` – limos Jul 01 '15 at 06:36
  • 7
    @KenArnold As pointed out by a comment by Crissistian Leonte in the [same thread](http://www.kahunaburger.com/2008/10/13/selecting-random-weighted-records-from-mysql/) `1 - RAND()` is actually slightly 'cleaner' because it removes the tiny chance that you end up doing `LOG(0)` which returns `NULL`. This is because `RAND()` returns 0 <= x < 1. Both solutions should return comparable results however. – Arth Oct 14 '16 at 10:09
  • I needed just the opposite and simply added 1 to that :) `SELECT * FROM table ORDER BY 1.0 + LOG(1.0 - RAND()) / (Multiplier + 1)` – Klemen Tusar May 09 '18 at 14:14
17

For a much better performance (specially on big tables), first index the weight column and use this query:

SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY -LOG(1-RAND())/weight LIMIT 10) AS t2 ON t1.id = t2.id

On 40MB table the usual query takes 1s on my i7 machine and this one takes 0.04s.

For explanation of why this is faster see MySQL select 10 random rows from 600K rows fast

Ali
  • 21,572
  • 15
  • 83
  • 95
  • 2
    Can you explain the significance of the subqueries? Why not `SELECT *` in the innermost subquery and do away with the other two? That then is just the form of the usual query. – concat Jan 10 '17 at 22:37
  • 4
    @concat That's because how SQL works: when you do an order on a big table it loads the whole data and then sorts according to the order by clause, but here the subquery only works on indexed data which are available in memory. see these tests: usual > https://i.stack.imgur.com/006Ym.jpg, subquery > https://i.stack.imgur.com/vXU8e.jpg the response time is highlighted. – Ali Jan 10 '17 at 23:41
  • I can now confirm, and while very unexpected, I think now I understand how this works. Thanks for showing me something new today! – concat Jan 11 '17 at 00:06
  • You're welcome, there are lots of unexpected things in SQL, this is one of them! – Ali Jan 11 '17 at 00:17
8

Don't use 0, 1 and 2 but 1, 2 and 3. Then you can use this value as a multiplier:

SELECT * FROM tablename ORDER BY (RAND() * Multiplier);
Frank Heikens
  • 117,544
  • 24
  • 142
  • 135
  • 2
    or just add 1: SELECT * FROM tablename ORDER BY (RAND() * (Multiplier+1)); – Silver Light Mar 10 '10 at 14:48
  • I thought of doing something like this, but I don't see how multiplying a random number by another number results in anything getting weighted. Also, how does it know which entry to take the multiplier value from? – John Mar 10 '10 at 14:52
  • @John: RAND() gives you a random number between 0 and 1. A bigger multiplier gives you bigger chance to end up with the biggest result. Sorting on this result makes sense. Do some tests with a large dataset and see the results. – Frank Heikens Mar 10 '10 at 15:32
  • add "limit 1" to the end of the select as well to just get a single row. This is not an efficient solution though, it'll be slow on big tables. – Peter N Lewis May 11 '10 at 06:09
  • 4
    This does not actually give the correct distribution (as I discovered by accident); limos's answer does. – Ken Arnold May 09 '13 at 23:32
  • 1
    This gives a horribly skewed distribution.. say there are 98 rows weighted 1 and 1 row weighted 2. RAND() will produce a number between 0 and 1, so 50% of the time the number will be > 0.5. For the row weighted 2, (RAND() * 2) will be greater than 1 50% of the time. This is larger than all (RAND() * 1) results, so the row weighted 2 will be selected at least 50% of the time. It should in fact be selected 2% of the time (2/100). – Arth Oct 14 '16 at 09:46
  • In fact because this is `ORDER BY ASC` not `DESC`, you've actually reduced the chance of the weighted rows being selected. – Arth Oct 14 '16 at 10:14
3

Well, I would put the logic of weights in PHP:

<?php
    $weight_array = array(0, 1, 1, 2, 2, 2);
    $multiplier = $weight_array[array_rand($weight_array)];
?>

and the query:

SELECT *
FROM `table`
WHERE Multiplier = $multiplier
ORDER BY RAND()
LIMIT 1

I think it will work :)

Silver Light
  • 44,202
  • 36
  • 123
  • 164
  • Interesting! The possible value for multiplier could theoretically be anything, but will probably go as high as 20. Wouldn't that make the array huge? Is that OK? – John Mar 10 '10 at 14:50
  • Well, you can make $weight_array dynamic, so that you dont have to type all the numbers by hand. Don't worry about resources - a thousand of int's is not much. – Silver Light Mar 10 '10 at 14:52
  • @John, then create the weight array dynamically with a for loop, by putting a 2nd for loop inside – TravisO Mar 10 '10 at 14:53
  • I'm not sure that this code do what I want it to do: Let's say I have 100 entries in the table: 98 have a multiplier of 0, 1 has a multiplier of 1 (counts as 2 entries), and 1 has a multiplier of 2 (counts as 3 entries). The chance of a 0-multiplier entry being chosen should be 98/103, of a 1-multiplier entry should be 2/103, and of a 2-multiplier entry should be 3/103. However, with your code the chances would be 1/6, 2/6, 3/6. Maybe I need to put every entry's ID into an array, with weighted entried entered multiple times, and then use array_rand? – John Mar 10 '10 at 15:28
  • You don't have to put each entry ID into an array. You could get a count by weight: 98 at 0, 1 at 1, 1 at 2. Put the offset position into the array and repeat (add it to the array again) according to the weight. So the array would contain the numbers 1 to 98 each appearing once, 99 appearing twice, and 100 appearing 3 times. Randomly pick a position from the array, sort your data by weight and take the item at the selected position. This would be more suitable for a larger data set. – Ali Gangji Mar 03 '18 at 00:44
3

While I realise this is an question on MySQL, the following may be useful for someone using SQLite3 which has subtly different implementations of RANDOM and LOG.

SELECT * FROM table ORDER BY (-LOG(abs(RANDOM() % 10000))/weight) LIMIT 1;

weight is a column in table containing integers (I've used 1-100 as the range in my table).

RANDOM() in SQLite produces numbers between -9.2E18 and +9.2E18 (see SQLite docs for more info). I used the modulo operator to get the range of numbers down a bit.

abs() will remove the negatives to avoid problems with LOG which only handles non-zero positive numbers.

LOG() is not actually present in a default install of SQLite3. I used the php SQLite3 CreateFunction call to use the php function in SQL. See the PHP docs for info on this.

fishter
  • 169
  • 1
  • 5
2
SELECT * FROM tablename ORDER BY -LOG(RAND()) / Multiplier;

Is the one which gives you the correct distribution.

SELECT * FROM tablename ORDER BY (RAND() * Multiplier);

Gives you the wrong distribution.

For example, there are two entries A and B in the table. A is with weight 100 while B is with weight 200. For the first one (exponential random variable), it gives you Pr(A winning) = 1/3 while the second one gives you 1/4, which is not correct. I wish I can show you the math. However I do not have enough rep to post relevant link.

agoodname
  • 29
  • 5
1
<?php
/**
 * Demonstration of weighted random selection of MySQL database.
 */
$conn = mysql_connect('localhost', 'root', '');

// prepare table and data.
mysql_select_db('test', $conn);
mysql_query("drop table if exists temp_wrs", $conn);
mysql_query("create table temp_wrs (
    id int not null auto_increment,
    val varchar(16),
    weight tinyint,
    upto smallint,
    primary key (id)
)", $conn);
$base_data = array(    // value-weight pair array.
    'A' => 5,
    'B' => 3,
    'C' => 2,
    'D' => 7,
    'E' => 6,
    'F' => 3,
    'G' => 5,
    'H' => 4
);
foreach($base_data as $val => $weight) {
    mysql_query("insert into temp_wrs (val, weight) values ('".$val."', ".$weight.")", $conn);
}

// calculate the sum of weight.
$rs = mysql_query('select sum(weight) as s from temp_wrs', $conn);
$row = mysql_fetch_assoc($rs);
$sum = $row['s'];
mysql_free_result($rs);

// update range based on their weight.
// each "upto" columns will set by sub-sum of weight.
mysql_query("update temp_wrs a, (
    select id, (select sum(weight) from temp_wrs where id <= i.id) as subsum from temp_wrs i 
) b
set a.upto = b.subsum
where a.id = b.id", $conn);

$result = array();
foreach($base_data as $val => $weight) {
    $result[$val] = 0;
}
// do weighted random select ($sum * $times) times.
$times = 100;
$loop_count = $sum * $times;
for($i = 0; $i < $loop_count; $i++) {
    $rand = rand(0, $sum-1);
    // select the row which $rand pointing.
    $rs = mysql_query('select * from temp_wrs where upto > '.$rand.' order by id limit 1', $conn);
    $row = mysql_fetch_assoc($rs);
    $result[$row['val']] += 1;
    mysql_free_result($rs);
}

// clean up.
mysql_query("drop table if exists temp_wrs");
mysql_close($conn);
?>
<table>
    <thead>
        <th>DATA</th>
        <th>WEIGHT</th>
        <th>ACTUALLY SELECTED<br />BY <?php echo $loop_count; ?> TIMES</th>
    </thead>
    <tbody>
    <?php foreach($base_data as $val => $weight) : ?>
        <tr>
            <th><?php echo $val; ?></th>
            <td><?php echo $weight; ?></td>
            <td><?php echo $result[$val]; ?></td>
        </tr>
    <?php endforeach; ?>
    <tbody>
</table>

if you want to select N rows...

  1. re-calculate the sum.
  2. reset range ("upto" column).
  3. select the row which $rand pointing.

previously selected rows should be excluded on each selection loop. where ... id not in (3, 5);

Gonzalo.-
  • 12,512
  • 5
  • 50
  • 82
sukhoi
  • 11
  • 3
  • Wouldn't this solution produce a substantial amount of overhead? I'm not sure how resource-intensive the creation of an entire table, manipulation of that table, then deletion would be on the system would be. Would an array of weighted values, dynamically generated, be simpler, less error-prone, and less resource-intensive? – Nathan Jun 25 '14 at 23:43
  • could be much improved by using window functions, if mysql has that. – Jasen Jun 26 '18 at 01:46
1

For others Googling this subject, I believe you can also do something like this:

SELECT strategy_id
FROM weighted_strategies AS t1 
WHERE (
   SELECT SUM(weight) 
   FROM weighted_strategies AS t2 
   WHERE t2.strategy_id<=t1.strategy_id
)>@RAND AND 
weight>0
LIMIT 1

The total sum of weights for all records must be n-1, and @RAND should be a random value between 0 and n-1 inclusive.

@RAND could be set in SQL or inserted as a integer value from the calling code.

The subselect will sum up all the preceeding records' weights, checking it it exceeds the random value supplied.

wally
  • 3,492
  • 25
  • 31
0

Whatever you do, it is giong to be terrible because it will involve: * Getting the total "weights" for all columns as ONE number (including applying the multiplier). * Getting a random number between 0 and that total. * Getting all entries and runing them along, deducting the weight from the random number and choosing the one entry when you run out of items.

In average you will run along half the table. Performance - unless the table is small, then do it outside mySQL in memory - will be SLOW.

TomTom
  • 61,059
  • 10
  • 88
  • 148
0

The result of the pseudo-code (rand(1, num) % rand(1, num)) will get more toward 0 and less toward num. Subtract the result from num to get the opposite.

So if my application language is PHP, it should look something like this:

$arr = mysql_fetch_array(mysql_query(
    'SELECT MAX(`Multiplier`) AS `max_mul` FROM tbl'
));
$MaxMul = $arr['max_mul']; // Holds the maximum value of the Multiplier column

$mul = $MaxMul - ( rand(1, $MaxMul) % rand(1, $MaxMul) );

mysql_query("SELECT * FROM tbl WHERE Multiplier=$mul ORDER BY RAND() LIMIT 1");

Explanation of the code above:

  1. Fetch the highest value in the Multiplier column
  2. calculate a random Multiplier value (weighted toward the maximum value in the Multiplier column)
  3. Fetch a random row which has that Multiplier value

It's also achievable merely by using MySQL.

Proving that the pseudo-code (rand(1, num) % rand(1, num)) will weight toward 0: Execute the following PHP code to see why (in this example, 16 is the highest number):

$v = array();

for($i=1; $i<=16; ++$i)
    for($k=1; $k<=16; ++$k)
        isset($v[$i % $k]) ? ++$v[$i % $k] : ($v[$i % $k] = 1);

foreach($v as $num => $times)
        echo '<div style="margin-left:', $times  ,'px">
              times: ',$times,' @ num = ', $num ,'</div>';
Dor
  • 7,344
  • 4
  • 32
  • 45
  • I'm racking my brain trying to understand what this code is doing, but I see some stuff there that I haven't seen before. Could you explain it in layman's terms? – John Mar 10 '10 at 15:13
  • Yes :) I've edited my post with explanation for the PHP code. – Dor Mar 10 '10 at 15:22
  • Looks good, but the majority of entries will have a multiplier of 0 and it doesn't look like this code will ever select them. – John Mar 10 '10 at 15:45
  • I can't see why not... You can assign to $mul the value of `( rand(1, $MaxMul) % rand(1, $MaxMul) )` – Dor Mar 10 '10 at 15:51
0

@ali 's answer works great but you can not control how much your result skews toward higher or lower weights, you can change multiplier but it's not a very dynamic approach.

i optimized the code by adding POWER(weight,skewIndex) instead of weight which makes higher weights to appear more with values more than 1 for skewIndex and appear less with values between 0 and 1.

SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY -LOG(1-RAND())/POWER(weight,skewIndex) LIMIT 10) AS t2 ON t1.id = t2.id

you can analyze query results with

SELECT AVG(weight) FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY -LOG(1-RAND())/POWER(weight,skewIndex) LIMIT 10) AS t2 ON t1.id = t2.id

for example setting skewIndex to 3 gives me average of 78% while skewIndex of 1 gives average of 65%

Pooya Estakhri
  • 1,129
  • 1
  • 11
  • 25