Select random row from a PostgreSQL table with weighted row probabilities

Question

Example input:

SELECT * FROM test;
 id | percent   
----+----------
  1 | 50 
  2 | 35   
  3 | 15   
(3 rows)

How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?

I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.

Any ideas?

@Clodoaldo, after 10k runs of above query i get next results ( position to count ): {1=6293, 2=3302, 3=405}, but i expect them to be nearly like that: {1=5000, 2=3500, 3=1500}. — Oleg Golovanov, Oct 23 '12 at 22:46
@OlegGolovanov OK, so the query works, but the distribution is wrong. — Craig Ringer, Oct 23 '12 at 22:59
Very interesting problem. Thanks for the question. In future it's worth being more specific about things like *why* something "doesn't work" or has "wrong" results, but otherwise ... good brain food, thanks. — Craig Ringer, Oct 23 '12 at 23:53

Branko Dimitrijevic · Accepted Answer · 2012-10-23T23:38:27.563

29

This should do the trick:

WITH CTE AS (
    SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
    SELECT id, SUM(percent) OVER (ORDER BY id) S, R
    FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;

The sub-query Q gives the following result:

1  50
2  85
3  100

We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.

BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).

[SQL Fiddle]

edited Oct 23 '12 at 23:38

answered Oct 23 '12 at 23:11

Branko Dimitrijevic

50,809
10
93
167

... but it doesn't; it produces a *different* wrong distribution. See http://sqlfiddle.com/#!12/b67b6/2 – Craig Ringer Oct 23 '12 at 23:16
@CraigRinger Yes, the problem was probably in repeated generation of the random number. By moving it to common table expression, it is generated only once, giving a [much nicer result](http://sqlfiddle.com/#!12/d2a88/1). – Branko Dimitrijevic Oct 23 '12 at 23:40
That's a nicer, faster query than what I wrote; we took the same approach to solving the problem but your solution is a heck of a lot more efficient than using nested windows to calcluate a weighted range like I did. – Craig Ringer Oct 23 '12 at 23:50
This probably doesn't need to be stated, but the CTE could be avoided if you know beforehand what the sum of your probabilities will be. For instance, if all rows' percent columns always add up to 100%, then we can take out the CTE, remove the cross join, and replace `where S >= R` with `where S >= random() * 100 – John Fawcett Mar 24 '15 at 20:49
@JohnFawcett Unfortunately, that wouldn't work, since `random()` would get evaluated (and produce a different value) for each row. But we want it to produce one value, and then pick a row based on that one value. Please see the history of the answer - I got it wrong the first time exactly because of multiple generation of random values until Craig Ringer spotted the problem. – Branko Dimitrijevic Mar 24 '15 at 21:28
@BrankoDimitrijevic ah - then following what I was initially saying (removing the CTE), could we make the inner query column list be: ` id, sum(percent) over(order by id) S, random() R – John Fawcett Mar 26 '15 at 17:28
@JohnFawcett I don't believe so, because of the same problem. We need to search through ALL rows (and return one row) based on SAME random value. Your proposal would cause a search on different random values. – Branko Dimitrijevic Mar 27 '15 at 08:42
@BrankoDimitrijevic you're absolutely right. I don't know why I didn't try it `select id, random() from some_random_table` returns a different `random()` value for each row. CTE it is. Thanks! – John Fawcett Mar 27 '15 at 17:28
Why order by id: It would seem you don't need any ordering, but if you did want to order it, you should order it by percent, as if somehow the percents are not in order the results will be skewed. – SamGoody Jun 21 '16 at 09:54
@SamGoody Because the same order was used in the windows function SUM. It doesn't really matter what order is used, as long as **no** two elements are in the same "position", and both the window function and the "cut-off" (at the end of the query) use the same order. In principle, any key could have been used, but not `percent` since it's not a key and two rows may share the same percent, skewing the result to whatever row the DBMS prefers to return first. – Branko Dimitrijevic Jun 21 '16 at 11:09
@Branko I meant S, not percent, my bad. S must be unique and is guaranteed to be in the right order (and logically, it makes sense to sort on S). EDIT: Unless the percentage is zero, in which case 2 users could match (the latter having 0%) and if we sorted by S a user with a 0% chance could win. Whereas if you sort by ID both times, that cannot happen, since a 0 would always match the user before him. – SamGoody Jun 22 '16 at 08:22
@SamGoody `S` is a running total. It follows whatever order was specified in `SUM ... ORDER BY ...`. – Branko Dimitrijevic Jun 23 '16 at 09:01
Just to overcomplicate this; is there a reasonable way to do this for a table representing several weighted probabiilities related to another table. eg, rather than just `id` and `percent` columns, we also have, say `user_id`, and we want to get a random weighted item from this table for each of 2 or 3 users. – dougalg Oct 15 '20 at 05:59
@dougalg Looks like an extra JOIN to me. – Branko Dimitrijevic Oct 15 '20 at 12:32
This really seems to work. I've looked at many of these, and this is the best so far! – JosephDoggie Feb 05 '21 at 14:08

score 11 · Answer 2 · answered Oct 10 '17 at 11:02

11

ORDER BY random() ^ (1.0 / p)

from the algorithm described by Efraimidis and Spirakis.

answered Oct 10 '17 at 11:02

Mechanic Wei

111
1
3

Demian Martinez · Answer 3 · 2020-05-25T09:01:49.080

Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.

Let's recap. The original question can perhaps be generalized as follows:

Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.

Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.

Now, consider some test data, which we'll put in a temporary table:

CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
    (1, 25),
    (2, 10),
    (3, 10),
    (4, 05)
) AS test(id, weight);

Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.

The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):

WITH p AS ( -- probability
    SELECT *,
        weight::NUMERIC / sum(weight) OVER () AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumprobability
    FROM p
)
SELECT
    cp.id,
    cp.weight,
    cp.probability,
    cp.cumprobability - cp.probability AS startprobability,
    cp.cumprobability AS endprobability
FROM cp
;

This will result in the following output:

 id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
  1 |     25 |         0.5 |              0.0 |            0.5
  2 |     10 |         0.2 |              0.5 |            0.7
  3 |     10 |         0.2 |              0.7 |            0.9
  4 |      5 |         0.1 |              0.9 |            1.0

The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:

SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;

Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series() to generate a random number a million times:

WITH p AS ( -- probability
    SELECT *,
        weight::NUMERIC / sum(weight) OVER () AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumprobability
    FROM p
),
fp AS ( -- final probability
    SELECT
        cp.id,
        cp.weight,
        cp.probability,
        cp.cumprobability - cp.probability AS startprobability,
        cp.cumprobability AS endprobability
    FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;

This will result in output similar to the following:

 id | count  
----+--------
 1  | 499679 
 3  | 200652 
 2  | 199334 
 4  | 100335

Which, as you can see, tracks the expected distribution perfectly.

Performance

The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:

     count | time (ms)
-----------+----------
     1,000 |         7
    10,000 |        25
   100,000 |       210
 1,000,000 |      1950

Adaptation to generate test data

I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.

In that situation I find it useful to compute the start and end distributions once and storing the results in a table:

CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
    (1, 25),
    (2, 10),
    (3, 10),
    (4, 05)
),
p AS ( -- probability
    SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) cumprobability
    FROM p
)
SELECT
    cp.id,
    cp.weight,
    cp.probability,
    cp.cumprobability - cp.probability AS startprobability,
    cp.cumprobability AS endprobability
FROM cp
;

I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.

I can even wrap it all in a function that I can call any time I want to get a random id:

CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
    SELECT id
    FROM test
    WHERE p_random BETWEEN startprobability AND endprobability
    ;
$$
LANGUAGE SQL STABLE STRICT

Window function frames

It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!

Hey, How could this be extended to pick N distinct entries (in this case from the test table)? Also, if possible I'd like to insert the picks immediately into another table. — Pirulax, Sep 26 '22 at 17:44

Craig Ringer · Answer 4 · 2012-10-23T23:54:40.193

Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.

To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9

At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .

UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.

This query produces the correct distribution (SQLFiddle):

WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
 SELECT id
FROM (                   
  SELECT 
    id,
    sum(percent) OVER (ORDER BY id),
    coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
      SELECT 
        id,
        percent,
        lag(percent) OVER () AS prev_percent
      FROM test
    ) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;

Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:

Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket

looks to me like you proved that it's not working. 3 is coming in at 4% whereas it should be 15%. — digitaljoel, Oct 23 '12 at 22:50
@digitaljoel Good point. I was presuming that their helpful "not working" was an issue with uncorrelated subquery optimisation producing the same result across a set, not an unexpected distribution. Hmm. *tries to dig up old probability lectures in brain*. — Craig Ringer, Oct 23 '12 at 22:56
@digitaljoel Got it; the trouble was multi-sampling of the random number. — Craig Ringer, Oct 23 '12 at 23:43

score 1 · Answer 5 · edited Oct 23 '12 at 22:54

1

Here is something for you to play with:

select t1.id as id1
  , case when t2.id is null then 0 else t2.id end as id2
  , t1.percent as percent1
  , case when t2.percent is null then 0 else t2.percent end as percent2 
from "Test1" t1 
  left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and 
  case when t2.percent is null then 0 else t2.percent end;

Essentially perform a left outer join so that you have two columns to apply a between clause.

Note that it will only work if you get your table ordered in the right way.

edited Oct 23 '12 at 22:54

digitaljoel

26,265
15
89
115

answered Oct 23 '12 at 22:52

Darren

722
4
12

You know it occurred to me that if you include a "sacrificial" row (0,0) in your table then you could simply do an inner join instead, and remove the pesky case statements. It would simplify the query a great deal. – Darren Oct 23 '12 at 23:43

Santo Guevarra · Answer 6 · 2018-10-03T06:12:36.353

Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).

WITH random AS (SELECT random() AS random)
SELECT id FROM (
    SELECT id, percent,
    SUM(percent) OVER (ORDER BY id) AS rank,
    SUM(percent) OVER () * random AS roll
    FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1

If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.

I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :

SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1

Stefan Falk · Answer 7 · 2023-04-24T20:58:00.643

From the this paper note that we have to compute random() ^ (-1.0 / p) (minus one).

ORDER BY RANDOM() ^ ( -1.0 / p )

The SQLFiddle example will give you:

id  percent  freq
1   40       0.39795 
2   30       0.29540 
3   20       0.20635
4   10       0.10030

Full Code

Schema

CREATE TABLE test
    (id integer, percent integer)
;
    
INSERT INTO test
    (id, percent)
VALUES
    (1, 40),
    (2, 30),
    (3, 20),
    (4, 10)
;

CREATE OR REPLACE FUNCTION get_random_row() RETURNS integer AS $SQL$
    SELECT id
    FROM test
    ORDER BY RANDOM() ^ ( -1.0 / percent )
    LIMIT 1
$SQL$ LANGUAGE sql VOLATILE;

Query

SELECT id, count(id)/10000.0 AS freq
FROM (
  SELECT get_random_row()
  FROM generate_series(1,10000)
) x(id)
GROUP BY id
ORDER BY 2;