27

How do you randomly select a table row in T-SQL based on an applied weight for all candidate rows?

For example, I have a set of rows in a table weighted at 50, 25, and 25 (which adds up to 100 but does not need to), and I want to select one of them randomly with a statistical outcome equivalent to the respective weight.

ConcernedOfTunbridgeWells
  • 64,444
  • 15
  • 143
  • 197
Dane
  • 9,737
  • 5
  • 28
  • 23
  • I think right now "Shiroy's" answer, the very last one as of 02/12/2021 is the best.... -- key line from that post by Shiroy is: ORDER BY POWER(RAND(CAST(NEWID() AS VARBINARY)), (1.0/Weight)) DESC – JosephDoggie Feb 12 '21 at 16:18

5 Answers5

19

Dane's answer includes a self joins in a way that introduces a square law. (n*n/2) rows after the join where there are n rows in the table.

What would be more ideal is to be able to just parse the table once.

DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1))

SELECT
    @id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END,
    @weight_point = @weight_point - [table].weight
FROM
    @table [table]
ORDER BY
    [table].Weight DESC

This will go through the table, setting @id to each record's id value while at the same time decrementing @weight point. Eventually, the @weight_point will go negative. This means that the SUM of all preceding weights is greater than the randomly chosen target value. This is the record we want, so from that point onwards we set @id to itself (ignoring any IDs in the table).

This runs through the table just once, but does have to run through the entire table even if the chosen value is the first record. Because the average position is half way through the table (and less if ordered by ascending weight) writing a loop could possibly be faster... (Especially if the weightings are in common groups):

DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)

SELECT @next_weight = MAX(weight) FROM @table
SELECT @row_count   = COUNT(*)    FROM @table WHERE weight = @next_weight
SET @weight_point = @weight_point - (@next_weight * @row_count)

WHILE (@weight_point > 0)
BEGIN
    SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight
    SELECT @row_count   = COUNT(*)    FROM @table WHERE weight = @next_weight
    SET @weight_point = @weight_point - (@next_weight * @row_count)
END

-- # Once the @weight_point is less than 0, we know that the randomly chosen record
-- # is in the group of records WHERE [table].weight = @next_weight

SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1))

SELECT
    @id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END,
    @row_count = @row_count - 1
FROM
    @table [table]
WHERE
    [table].weight = @next_weight
ORDER BY
    [table].Weight DESC
MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • 1
    I did some empirical tests and found out that your solution is very sensitive on input data. My testing data - weights: 2, 998, iterations: 1M. Weight 2 should be picked up about 2k times. If order of the weights in the table is ascending (2, 998), it's picking up the weight 2 just about 500 times. If you invert the order, it's about 2500 times. If you change `ROUND` to `FLOOR`, for ascending order it picks up the weight 2 about 1000 times, for descending about 2000 times. And that's the proper probability. I have updated your answer. – David Ferenczy Rogožan Sep 01 '16 at 19:01
  • 1
    I'm just not sure, why the `FLOOR` works better than the `ROUND`. With the `ROUND`, it's picking up the small weight too few times (1/4 times) with the ascending order or too many times with the descending order. The `FLOOR` is also picking up the small weight too few times (1/2 times) with the ascending order, but with almost ideal probability when weights are sorted in descending order. – David Ferenczy Rogožan Sep 01 '16 at 19:12
  • 1
    Am I going mad, or shouldn't the first `SELECT @row_count = COUNT(*) FROM @table` have a `WHERE weight = @next_weight` appended to it? Otherwise @weight_point will always be 0 or less going into the loop check, and so the top value will always be selected. – oflahero Dec 02 '16 at 00:07
  • 1
    There's a minor issue with FLOOR as function requires 1 argument(s). – Dan Dec 22 '20 at 10:49
  • 1
    @Dan - That's what comes of other people editing answers. Good spot, I'll fix their edit now. – MatBailie Dec 22 '20 at 10:52
  • @JosephDoggie - "It doesn't work" is the singularly most unhelpful phrase on StackOverflow. To help we need to know the code you ran, then error you got, etc, etc. The best advice I can offer is to open a new question, giving an example of what you're trying to do, your code and error messages, etc, then include a link to this question for people to refer to. *(This is also for a very old version of SQL Server, if you do open a new question, include which version you're using and you may get shorter and better code.)* – MatBailie Feb 04 '21 at 20:58
  • If copied into TSQL, there are errors, such as on the RAND function etc – JosephDoggie Feb 04 '21 at 21:51
  • It appears to work now, however for me (I use dbVisualizer) it isn't outputting any useful results to the command window. I have other priorities, not sure why. It is known that dbVisualizer doesn't "play well" with TSQL. Thanks for fixing – JosephDoggie Feb 08 '21 at 20:07
  • 1
    @JosephDoggie: what problem do you run into with DbVisualizer? I am sure we can sort this out, just open Help->Contact Support and post your support ticket. – roger Feb 11 '21 at 08:15
  • Still, for TSQL it seems very hard to declare simple variables or stored-proc parameters in dbvisualizer. I have noted this, as does a colleague. Removing extraneous ";" characters seems to help. Unfortunatley CTE requires ";" so it is a known problem. If you work there at DbVisualizer, perhaps you can bring it to people's attention -- I also did open my own support ticket. – JosephDoggie Feb 11 '21 at 16:15
  • For db visualizer, add: SELECT @id as ourRandNumber at very end -- I tried to edit this in as well – JosephDoggie Feb 11 '21 at 16:30
  • In my testing, with my proposed edit, 3 comes up more frequently than 1 and 2 never comes up at all. Am I missing something? [PROPOSED EDIT IS simple add to select output -- SELECT @id as ourRandNumber] That Select goes at the very end of the SQL-code – JosephDoggie Feb 12 '21 at 13:52
8

You simply need to sum the weights of all candidate rows, then choose a random point within that sum, then select the record that coordinates with that chosen point (each record is incrementally carrying an accumulating weight sum with it).

DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)

INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)

SELECT @weight_sum = SUM(weight)
FROM @table

SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)

SELECT TOP 1 @id = t1.id
FROM @table t1, @table t2
WHERE t1.id >= t2.id
GROUP BY t1.id
HAVING SUM(t2.weight) >= @weight_point
ORDER BY t1.id

SELECT @id
Dane
  • 9,737
  • 5
  • 28
  • 23
  • 2
    I did a small benchmark of your and MatBailie's solutions and it looks that yours takes about twice as much time as Mat's. On a table with 2 rows and 1 milion iterations, Mat's solution took about 45 seconds and your solution about 85 seconds. – David Ferenczy Rogožan Sep 01 '16 at 18:40
4

The "incrementally carrying a an accumlating[sic] weight sum" part is expensive if you have a lot of records. If you also already have a wide range of scores/weights (ie: the range is wide enough that most records weights are unique. 1-5 stars probably wouldn't cut it), you can do something like this to pick a weight value. I'm using VB.Net here to demonstrate, but this could easily be done in pure Sql as well:

Function PickScore()
    'Assume we have a database wrapper class instance called SQL and seeded a PRNG already
    'Get count of scores in database
    Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]")
    ' You could also approximate this with just the number of records in the table, which might be faster.

    'Random number between 0 and 1 with ScoreCount possible values
    Dim rand As Double = Random.GetNext(ScoreCount) / ScoreCount

    'Use the equation y = 1 - x^3 to skew results in favor of higher scores
    ' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1
    rand = 1 - (rand * rand * rand)

    'Now we need to map the (0,1] vector to [1,Maxscore].
    'Just find MaxScore and mutliply by rand
    Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs")
    Return MaxScore * rand
End Function

Run this, and pick the record with the largest score less than the returned weight. If more than one record share that score, pick it at random. The advantages here are that you don't have to maintain any sums, and you can tweak the probability equation used to suit your tastes. But again, it works best with a larger distribution of scores.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
3

If you need to do get a group of samples (say, you want to sample 50 rows from a collection of 5M rows) where each row has a column called Weight which is an int and where larger values means more weight, you can use this function:

SELECT * 
FROM 
(
    SELECT TOP 50 RowData, Weight 
    FROM MyTable 
    ORDER BY POWER(RAND(CAST(NEWID() AS VARBINARY)), (1.0/Weight)) DESC
) X 
ORDER BY Weight DESC

The key here is using the POWER( ) function as illustrated here

A reference on the choice of a random function is here and here

Alternatively you can use:

1.0 * ABS(CAST(CHECKSUM(NEWID()) AS bigint)) / CAST(0x7FFFFFFF AS INT) 

You cast checksum as BIGINT instead of INT because of this issue:

Because checksum returns an int, and the range of an int is -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647), the abs() function can return an overflow error if the result happens to be exactly -2,147,483,648! The chances are obviously very low, about 1 in 4 billion, however we were running it over a ~1.8b row table every day, so it was happening about once a week! Fix is to cast the checksum to bigint before the abs.

Shiroy
  • 1,648
  • 1
  • 15
  • 22
  • This seems to be the *BEST* option and also avoids a lot of complicated joins, looping, etc. Well done! – JosephDoggie Feb 08 '21 at 15:31
  • As noted elsewhere on the web, a colleague pointed out that the following is considered more 'numerically stable': ORDER BY -(1.0/Weight) * LOG(RAND(CAST(NEWID() AS VARBINARY))) ASC – JosephDoggie Feb 08 '21 at 15:32
3

The way to do this with random number generators is to integrate the probabiliity density function. With a set of discrete values you can calculate the prefix sum (sum of all values up to this one) and store it. With this you select the minumum prefix sum (aggregate to date) value greater than the random number.

On a database the subsequent values after an insertion have to be updated. If the relative frequency of updates and size of the data set doesn't make the cost of doing this prohibitive it means that the appropriate value can be obtained in from a single s-argable (predicate that can be resolved by an index lookup) query.

ConcernedOfTunbridgeWells
  • 64,444
  • 15
  • 143
  • 197