Select TOP 1 from records with weight taken into account

Question

I already saw this post here (http://stackoverflow.com/questions/1398113/sql-select-one-row-randomly-but-taking-into-account-a-weight), but couldnt work it out. Where do I put the 'Stuff' table? Why don't they use NEWID() instead of RND()?

Table Stuff

id     item       weight       location
1      ball       1            Wyoming
2      cup        2            Alaska
3      sock       1            Idaho
4      car        3            Miami
5      hot girl   5            Brazil

Now according to that article referenced above I'm supposed to do this

SELECT      TOP 1 t.*
FROM        @Table t
INNER JOIN (SELECT t.id, sum(tt.weight) AS cum_weight
            FROM        @Table t
            INNER JOIN  @Table tt ON  tt.id <= t.id
            GROUP BY    t.id) tc
        ON  tc.id = t.id,
           (SELECT  SUM(weight) AS total_weight FROM @Table) tt,
           (SELECT  RAND() AS rnd) r
WHERE       r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY    t.id ASC

I want to do the above, but in this fashion:

SELECT TOP (1) from stuff WHERE blahblahblah AND (location='Brazil' OR location='Wyoming' OR location='Brazil') AND (weight <= cum_weight) ORDER BY NEWID()

I'm only guessing I can use NEWID() and not compelled to use RND()

Can you explain in English what you are trying to do? Your queries success taking a group of records whose cumulative sum is less than a given weight and ordering them randomly. That does not seem useful. — Gordon Linoff, May 20 '12 at 03:27
I'm trying to select 1 random row per query with weight taken into account. A rowset with a higher weight should appear more often than a row with a lesser weight. A weight of 5 should be 5 times more likely to apear than a weight of say, 1. If more than one row fulfills that scenario (rows with same weight), then select one rowset from that group. Something quick and dirty would be OK, I dont need it to be statistically perfect — Patriotec, May 20 '12 at 03:34
Please don't prefix your titles with "MSSQL" and such. That's what the tags are for. — John Saunders, May 20 '12 at 04:52

Gordon Linoff · Accepted Answer · 2012-05-20T04:28:48.143

0

You could accomplish this by sampling from the cumulative sum, instead of from the records. The idea is to take a cumulative sum of the weights, then to take a random value up to the maximum weight, and finally to see which record has a cumulative sum around that random value. The SQL looks like:

select top 1 t.*
from (select t.*, cumulative_sum(weight) as cumweight,
             sum(weight) over (partition by NULL) as totalweight
      from t
     ) t
where rand()*(totalweight+1) < cumweight
order by cumweight desc

What this does is create a cumulative weight and then create a random variable up to the total sum of the weights. The last record whose cum weight is less than the sumweight is chosen. The "+1" is just to be sure that any record could be chosen, even the last.

In SQL Server 2012, you can calculate the cumulative SUM using SUM() over (partition by NULL order by ).

In SQL Server 2012, you might use: select top 1 t.* from (select t., sum(weight) over (partition by NULL order by weight) as cumweight, sum(weight) over (partition by NULL) as totalweight from t ) t where rand()(totalweight+1) < cumweight order by cumweight desc

Unfortunately, this syntax is not support in SQL Server 2008. In that database, you need to do a self-join, which is the query that you pulled from the original article.

edited May 20 '12 at 04:28

answered May 20 '12 at 03:50

Gordon Linoff

1,242,037
58
646
786

cumulative_sum(weight) as cumweight, sum(weight) over (partition by NULL) as totalweight from t. the cumulative_sum. It's not a function. How do I calculate the cumulative sum for weight? – Patriotec May 20 '12 at 04:14
Sorry, yes I am using SQL 2012 for this. When I try to run this, I get an error 'cumulative_sum' is not a recognized built-in function name – Patriotec May 20 '12 at 04:23
The cumulaitve_sum() was a conceptual idea. The correct syntax is SUM(var) over (partition by NULL order by ). – Gordon Linoff May 20 '12 at 04:27
You lost me, I rewrote the code and I'm stuck on that one point. you reference cumweight in the order by portion. What do I actually put in the place of 'cumulative_sum(weight) as cumweight'? – Patriotec May 20 '12 at 04:32
Thanks for the help. I ran a few queries and it looks good so far. Now I'll go create a loop and insert into a temp table to see how the distribution is. – Patriotec May 20 '12 at 05:23
I made a loop and inserted into a temp table 2000 times. FYI, not a good distribution with large records (or for that matter small sets of data either). That query keeps selecting the same rowsets over and over. Some rows were never selected. Not much of a spread at all. – Patriotec May 20 '12 at 16:01

Select TOP 1 from records with weight taken into account

1 Answers1