Let me try to think aloud.
The best approach should deterministically select the specific entry for each value in range (0,1). The zero should be kept for entries with zero weight.
So that the item one could be possible, each entry should know its bottom and upper limits in the range between zero and one. That is exactly what you say in your possible solution. You also say that this will imply a lot of work on an entry weight change. But anyway you need to either readjust weights, so that they sum up to 1, or update and cache somewhere the sum of all weights, so that you could compute the percentage of each weight. Hm, you are right, if you readjust weights between the first and the last entry, you will have to update limits for each entry in between.
This leads us to the solution, where you compute the limits (the lower or the upper, you actually need only one) in run time, summing weights up to the entry. The good news is that you can use a covering index on weight, that is read weight values from memory.
The disadvantage of the approach is that the execution time of the query will depend on the random value: the more the value, the longer the query takes. But, the ranges for the bigger weights will be matched more often, so we might somehow take advantage from the fact that the weights are ordered in the btree index, force index lookup from the end, short cirquit the processing, when the value is found (not sure yet though, that this is possible for a query accumulating a value).
I should think a bit longer.
Upd1. I have just realized that I described exactly the same that written in the linked answer. Given that the query will take weights from an index, the solution should be fast enough. There is, certainly, an even faster solution for selection, but requiring more space and preparation. This might seem crazy, but there are cases where it might work. You can describe your weights distribution as a fixed range of int values, and maintain a (memory) table of mappings from any int value from the range to the specific weighted entry. Then the query will round a random value to an int, the int value (being the primary key in the memory table) will point to some entry id. Obviously, the number of rows in the table will depend on the granularity of the weights, and you will have to update the whole table after any weight update. But in case a weight update is rare, this might be an option.
Upd2. I decided to show some SQL. Below are two working solutions. Say, we have a table:
CREATE TABLE entries (
entry_id int(10) unsigned NOT NULL AUTO_INCREMENT,
weight float NOT NULL DEFAULT 0.,
data varchar(50),
PRIMARY KEY (entry_id) USING BTREE,
KEY weights (weight) USING BTREE
) ENGINE=InnoDB;
INSERT INTO entries (weight) VALUES (0.), (0.3), (0.1), (0.3), (0.0), (0.2), (0.1);
The best query we can think of will have a ready mapping from a rand() value to the specific entry_id. In this case all we need is to find an entry by the primary key. As I said, the table for such a query will take some space, but suppose we are ready for that. We might want to keep the mapping in memory, so we can use the MEMORY engine, which is using HASH index as the primary key (which is good as well, since we mant to map a value to the specific value).
Let's look at our table:
mysql> SELECT entry_id, weight FROM entries ORDER BY weight;
+----------+--------+
| entry_id | weight |
+----------+--------+
| 1 | 0 |
| 5 | 0 |
| 3 | 0.1 |
| 7 | 0.1 |
| 6 | 0.2 |
| 2 | 0.3 |
| 4 | 0.3 |
+----------+--------+
Let's create another table and fill it with values:
CREATE table int2entry (
an_int int(10) unsigned NOT NULL AUTO_INCREMENT,
entry_id int(10) unsigned NOT NULL,
PRIMARY KEY (an_int)
) ENGINE=Memory;
TRUNCATE int2entry;
INSERT INTO int2entry (entry_id)
VALUES (3), (7), (6), (6), (2), (2), (2), (4), (4), (4);
The idea is that the number of references to the specific entry_id is proportional to the weight. The table might be hard to update using SQL only, and you have to truncate and update it after each weight change, but, as I said, this still can be an option when updates are rare. Here is the query to get the entry_id which you can then join to the entries table (you should know the number of rows in the mapping table):
SELECT entry_id
FROM (SELECT ceiling(rand() * 10) as an_int) as t1
JOIN int2entry USING (an_int);
Another solution is to use the cumulative weights and take advantage from the order in the index.
When we select data, the data is selected in some index order (for select * in the primary key order). The weights
index is an ordered mapping from weights to entry_ids. If we select only weights and entry_ids, the values can be taken directly from the weights
index, thats is the data will be read in the index order. We can use ORDER BY to force iteration in the reversed index order (the larger weights are stored in the end, but are going to be matched more frequent). Why it is important? Because we are going to add some hacky magic to the WHERE clause, and count on the specific order of processing rows:
SET @rand:= RAND(), @cum_weight:=0.;
SELECT entry_id, weight, @cum_weight, @rand
FROM entries
WHERE @rand < @cum_weight:=@cum_weight+weight
ORDER BY weight DESC
LIMIT 1;
+----------+--------+----------------------------------+--------------------+
| entry_id | weight | @cum_weight | @rand |
+----------+--------+----------------------------------+--------------------+
| 6 | 0.2 | 0.800000026822090100000000000000 | 0.6957228003961247 |
+----------+--------+----------------------------------+--------------------+
In practice, you need only entry id, that is the resulting query should be something like:
SELECT *
FROM entries
JOIN (
SELECT entry_id
FROM entries
JOIN (SELECT @rand:= RAND(), @cum_weight:=0.) as init
WHERE @rand < @cum_weight:=@cum_weight+weight
ORDER BY weight DESC
LIMIT 1) as rand_entry USING (entry_id);
Mind the LIMIT 1
which stops processing when the required entry is found.
Besides,
- you should probably use DECIMAL type instead of FLOAT or DOUBLE to store weights to avoid undesired rounding errors on a large set of small weights.
- you can hide the funny query into a stored function, and use cursors. The syntax might seem more comprehensible.