15

I have a table which looks like that:

id: primary key
content: varchar
weight: int

What I want to do is randomly select one row from this table, but taking into account the weight. For example, if I have 3 rows:

id, content, weight
1, "some content", 60
2, "other content", 40
3, "something", 100

The first row has 30% chance of being selected, the second row has 20% chance of being selected, and the third row has 50% chance of being selected.

Is there a way to do that? If I have to execute 2 or 3 queries it's not a problem.

double-beep
  • 5,031
  • 17
  • 33
  • 41
FWH
  • 3,205
  • 1
  • 22
  • 17
  • 3
    See this question: http://stackoverflow.com/questions/58457/random-weighted-choice-in-t-sql – nickf Sep 09 '09 at 07:39

7 Answers7

19

I think the simplest is actually to use the weighted reservoir sampling:

SELECT
  id,
  -LOG(RAND()) / weight AS priority
FROM
  your_table
ORDER BY priority
LIMIT 1;

It's a great method that lets you choose M out of N elements where the probability to be chosen for each element is proportional to its weight. It works just as well when you happen to only want one element. The method is described in this article. Note that they choose the biggest values of POW(RAND(), 1/weight), which is equivalent to choosing the smallest values of -LOG(RAND()) / weight.

user711413
  • 761
  • 5
  • 12
  • 4
    This is a wonderful answer! Thank you! Just adding my two cents: wouldn't it be more elegant to write log(1-rand()) for avoiding log(0) since random values probably are in [0,1[ (not checked however)? – Thomas Baruchel Jan 26 '21 at 15:10
  • This looks like a good method, but the distribution can be very skewed. I tried weights for several rows where all weights were either 67 or 33 (i.e. approximately 2/3 or 1/3) and in my instance all rows chosen had the higher weight. Not sure why. – JosephDoggie Feb 05 '21 at 13:29
3

This works in MSSQL and I am sure that it should be possible to change couple of keywords to make it work in MySQL as well (maybe even nicer):

SELECT      TOP 1 t.*
FROM        @Table t
INNER JOIN (SELECT      t.id, sum(tt.weight) AS cum_weight
            FROM        @Table t
            INNER JOIN  @Table tt ON  tt.id <= t.id
            GROUP BY    t.id) tc
        ON  tc.id = t.id,
           (SELECT  SUM(weight) AS total_weight FROM @Table) tt,
           (SELECT  RAND() AS rnd) r
WHERE       r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY    t.id ASC

The idea is to have a cumulative weight for each row (subselect-1), then find the position of the spanned RAND() in this cumulative range.

van
  • 74,297
  • 13
  • 168
  • 171
3

I have tried van's solution and, although it works, it is not quick.

My Solution

The way that I am solving this problem is by maintaining a separate, linked table for the weighting. The basic table structure is similar to this:

CREATE TABLE `table1` (
  `id` int(11) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  `name` varchar(100),
  `weight` tinyint(4) NOT NULL DEFAULT '1',
);

CREATE TABLE `table1_weight` (
  `id` bigint(20) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  `table1_id` int(11) NOT NULL
);

If I have a record in table1 with a weight of 3, then I create 3 records in table1_weight, linked to table1 via the table1_id field. Whatever the value of weight is in table1, that's how many linked records I create in table1_weight.

Testing

On a dataset with 976 records in table1 with a total weight of 2031 and hence 2031 records in table1_weight, I ran the following two SQLs:

  1. A version of van's solution

    SELECT t.*
    FROM table1 t
    INNER JOIN
      ( SELECT t.id,
           SUM(tt.weight) AS cum_weight
       FROM table1 t
       INNER JOIN table1 tt ON tt.id <= t.id
       GROUP BY t.id) tc ON tc.id = t.id,
      ( SELECT SUM(weight) AS total_weight
       FROM table1) tt,
      ( SELECT RAND() AS rnd) r
    WHERE r.rnd * tt.total_weight <= tc.cum_weight
    ORDER BY t.id ASC
    LIMIT 1
    
  2. Joining to a secondary table for the weighting

SELECT t.*
FROM table1 t
INNER JOIN table1_weight w
    ON w.table1_id = t.id
ORDER BY RAND()
LIMIT 1

SQL 1 takes consistently 0.4 seconds.

SQL 2 takes between 0.01 and 0.02 seconds.

Conclusion

If the speed of selection of a random, weighted record is not an issue, then the single table SQL suggested by van is fine and doesn't have the overhead of maintaining a separate table.

If, as in my case, a short selection time is critical, then I would recommend the two table method.

double-beep
  • 5,031
  • 17
  • 33
  • 41
2

A simple approach (avoiding joins or subqueries) is to just multiply the weight by a random number between 0 and 1 to produce a temporary weight to sort by:

SELECT t.*, RAND() * t.weight AS w 
FROM table t 
ORDER BY w DESC
LIMIT 1

To understand this, consider that RAND() * 2x will be a larger value than RAND() * x approximately two thirds of the time. Consequently, over time each row should be selected with a frequency that's proportional to its relative weight (eg. a row with weight 100 will be selected about 100 times more often than a row with weight 1, etc).

Update: this method doesn't in fact produce the correct distributions, so for now don't use it! (see the comments below). I think there should still be a simple method similar to the above that will work, but for now the more complex method below, involving joins, might be better. I'm leaving this answer up because: (a) there's relevant discussion in the comments below, and (b) if/when I get a chance, I'll try to fix it.

Nick F
  • 9,781
  • 7
  • 75
  • 90
  • It works well when you choose from a low number of row (best 2). I need to randomly choose from 50 rows. 1 have a weight of 32, 1 a weight of 3 and 48 a weight of 1 for a total wieght of 83. So my row of 32 should have a 38.6% chance of being chosen, but with this method, it has 32 more chance to be chosen that all those with a weight of 1. Is there a way to take the total weight into account? THANKS!! – fgcarto Aug 12 '13 at 12:20
  • 1
    Doesn't this work in your case? In your case, the chance of the row with a weight of 32 being chosen should be 32/83 (0.386, or 38.6%). The chance of a row with a weight of 1 being chosen should be 1/83 (0.012, or 1.2%). But since **32/83 = 32 * 1/83**, it's still the case that the thing with a weight of 32 should be chosen 32 times more often than a thing with a weight of 1! – Nick F Aug 13 '13 at 13:50
  • I may have made a mistake in my script, but I had 30+ times the row with a weight of 32 and on of the others once in a while. It was chosen 32 times more often than all the others. I ended creating a temp table with the total weight, using it to have the weight in % (SELECT id FROM near50, total_weight ORDER BY Random()*(1/(WEIGHT*100/total_weight.weight)) LIMIT 1). – fgcarto Aug 13 '13 at 16:07
  • I'm sorry, I don't understand the problem here. Surely it _should_ be chosen 32 times more often than the others? This is the intended behaviour of my query, but it also fits with what you say you expect: since **38.6 = 32 * 1.2** surely this is just another way of saying _exactly the same thing_! ie. if you expect something to happen 38.6% of the time, then _by definition_ you must expect it to happen about 32 times more often than something that happens 1.2% of the time. I can't see why your temp table is necessary. Please think through carefully & make sure there really is a problem here! – Nick F Aug 13 '13 at 16:37
  • 1
    I understand what you are saying. Of course it should have 32 time more chances of being picked and any other with a weight of 1. What I'm saying is that in my script, it was choosen 32 times more often and all the others united. On a 1000 test, I had something like 960 times the one with the weight of 32 and 40 for the rest. I should have picked it around 386 times. My comment was based on my observation. – fgcarto Aug 13 '13 at 18:28
  • 2
    Pretty sure this won't give you the expected distribution. Consider 3 rows, of weights 80, 10, and 10. We expect the first row to be picked 80% of the time and the the others with equal probability, the other 20% of the time. If rand()*80 > 10, then we must select the first row. If rand()*80 is equally distributed between [0, 80] the odds of exceeding 10 are 69/81, which is 85%. It will be overrepresented. Even if I made some off-by-one errors here. – Daniel Papasian Apr 20 '14 at 01:19
  • As Daniel says, this doesn't give the expected distribution – Eric G May 06 '14 at 05:11
  • @DanielPapasian "rand()*80 > 10" I'm not sure how this is relevant to the query in the answer, which is seeking the highest of 3 weighted random values. The answer looks intuitively correct, but I'm not saying it is. I just don't see how your reasoning refutes it. (Also, `rand()` must be floating point for this to be fully accurate, so it's more like 70 and 80 than 69 and 81, to the extent that matters here.) – mahemoff Jul 25 '14 at 16:56
  • Yes, my answer does _look_ intuitively correct, but @DanielPapasian is right. The distributions produced by this method are not right. Thanks for the feedback. I've amended the answer accordingly, and if you can see how to improve the approach above, please feel free to edit my answer further! – Nick F Jul 27 '14 at 16:43
  • this works reasonably well for small numbers of rows but if there are 1000 rows weight 100 and 1000 with weight 99, the 99s will see very little action. I think use of a uniform random distribution is the problem. – Jasen Jun 25 '18 at 22:51
0

This one seems to work, but I'm not sure of the math behind it.

SELECT RAND() / t.weight AS w, t.* 
FROM table t 
WHERE t.weight > 0
ORDER BY 1
LIMIT 1

My guess at the reason it works is that the ascending order looks for the smallest results and by dividing by the weight for higher weights the random result is clustered more tightly near zero.

I tested it (actually the same algorithm in postgresql) with 209000 queries over 3000 rows and the weight representation came out correct.

my input data:

select count(*),weight from t group by weight
 count | weight 
-------+--------
  1000 |     99
  1000 |     10
  1000 |    100
(3 rows)

my results:

jasen=# with g as ( select generate_series(1,209000) as i )
,r as (select (  select t.weight as w 
    FROM  t 
    WHERE t.weight > 0
    ORDER BY ( random() / t.weight ) + (g.i*0)  LIMIT 1 ) from g)

select r.w, count(*), r.w*1000 as expect from r group by r.w;

  w  | count | expect 
-----+-------+--------
  99 | 98978 |  99000
  10 | 10070 |  10000
 100 | 99952 | 100000
(3 rows)

The +(g.i*0) has no effect on the arithmetic result but an external reference is required to to force the planner to re-evaluate the sub-select for each of the 209K input rows produced in in g

Jasen
  • 11,837
  • 2
  • 30
  • 48
-1

Maybe this one:

SELECT * FROM <Table> T JOIN (SELECT FLOOR(MAX(ID)*RAND()) AS ID FROM <Table> ) AS x ON T.ID >= x.ID LIMIT 1;

Or this one:

SELECT * FROM tablename
          WHERE somefield='something'
          ORDER BY RAND() LIMIT 1
Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Pavlo Svirin
  • 508
  • 4
  • 8
  • you're ignoring the weights, - records with higher weight shoudl occur more frequently in the result. – Jasen Jun 25 '18 at 23:09
-4

I don't remember how to RND() in mysql, but here working example for MSSQL:

SELECT TOP(1) (weight +RAND ()) r, id, content, weight FROM Table
ORDER BY 1 DESC

If TOP(1) is not applicable you just fetch first record from total result set.

Dewfy
  • 23,277
  • 13
  • 73
  • 121