I have tried van's solution and, although it works, it is not quick.
My Solution
The way that I am solving this problem is by maintaining a separate, linked table for the weighting. The basic table structure is similar to this:
CREATE TABLE `table1` (
`id` int(11) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
`name` varchar(100),
`weight` tinyint(4) NOT NULL DEFAULT '1',
);
CREATE TABLE `table1_weight` (
`id` bigint(20) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
`table1_id` int(11) NOT NULL
);
If I have a record in table1
with a weight of 3, then I create 3 records in table1_weight
, linked to table1
via the table1_id
field. Whatever the value of weight
is in table1
, that's how many linked records I create in table1_weight
.
Testing
On a dataset with 976 records in table1
with a total weight of 2031 and hence 2031 records in table1_weight
, I ran the following two SQLs:
A version of van's solution
SELECT t.*
FROM table1 t
INNER JOIN
( SELECT t.id,
SUM(tt.weight) AS cum_weight
FROM table1 t
INNER JOIN table1 tt ON tt.id <= t.id
GROUP BY t.id) tc ON tc.id = t.id,
( SELECT SUM(weight) AS total_weight
FROM table1) tt,
( SELECT RAND() AS rnd) r
WHERE r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY t.id ASC
LIMIT 1
Joining to a secondary table for the weighting
SELECT t.*
FROM table1 t
INNER JOIN table1_weight w
ON w.table1_id = t.id
ORDER BY RAND()
LIMIT 1
SQL 1 takes consistently 0.4 seconds.
SQL 2 takes between 0.01 and 0.02 seconds.
Conclusion
If the speed of selection of a random, weighted record is not an issue, then the single table SQL suggested by van is fine and doesn't have the overhead of maintaining a separate table.
If, as in my case, a short selection time is critical, then I would recommend the two table method.