Select weighted random entry from MySQL table in sublinear time

Question

I've seen plenty of posts advocating for the use of something along the lines of SELECT * FROM tbl ORDER BY -LOG(RAND())/weights LIMIT 1; to select a random entry from a table. However, this seems terribly inefficient to me, since we have to run through the entire table and generate random numbers for each before sorting. This answer seems to be more of a step in the right direction: assuming that the overall sum is calculated beforehand, now we've got ourselves a simple linear search.

Still, it's definitely possible to do better, but the methods that come to mind all seem to require sketchy manoeuvres.

We could store a cumulative weight distribution, generate a random number between 0 and max of weights and use an index on weights coupled with BETWEEN to find posts. However, deleting or moving entries in the middle requires a lot of work updating the weights after it.
We could fragment the table into sqrt(n) smaller tables and calculate the sum of the weights within it. We first search through these ranges until we arrive at the one containing our selected random number, then perform a linear search on that table. However, having so many tables for large n seems like bad database design, and ideally I'd want to get it down to logarithmic time instead of O(sqrt(n)).

I've been struggling with this problem for a while now, and this is the best I've come up with. Any other ideas?

Check out my answer: http://stackoverflow.com/a/41577458/893432 — Ali, Jan 10 '17 at 22:00
@Ali I've run your solution on a ~7MB InnoDB table, and haven't found a significant difference between the indexed query and unindexed, "usual" query. I'm still convinced that the query is being executed as described above. — concat, Jan 10 '17 at 22:59
@Ali Your query is certainly faster when there are unindexed columns, but fundamentally it is still `Ω(n)`. — concat, Jan 11 '17 at 00:09

newtover · Answer 1 · 2014-12-23T11:57:16.117

Let me try to think aloud.

The best approach should deterministically select the specific entry for each value in range (0,1). The zero should be kept for entries with zero weight.
So that the item one could be possible, each entry should know its bottom and upper limits in the range between zero and one. That is exactly what you say in your possible solution. You also say that this will imply a lot of work on an entry weight change. But anyway you need to either readjust weights, so that they sum up to 1, or update and cache somewhere the sum of all weights, so that you could compute the percentage of each weight. Hm, you are right, if you readjust weights between the first and the last entry, you will have to update limits for each entry in between.
This leads us to the solution, where you compute the limits (the lower or the upper, you actually need only one) in run time, summing weights up to the entry. The good news is that you can use a covering index on weight, that is read weight values from memory.
The disadvantage of the approach is that the execution time of the query will depend on the random value: the more the value, the longer the query takes. But, the ranges for the bigger weights will be matched more often, so we might somehow take advantage from the fact that the weights are ordered in the btree index, force index lookup from the end, short cirquit the processing, when the value is found (not sure yet though, that this is possible for a query accumulating a value).

I should think a bit longer.

Upd1. I have just realized that I described exactly the same that written in the linked answer. Given that the query will take weights from an index, the solution should be fast enough. There is, certainly, an even faster solution for selection, but requiring more space and preparation. This might seem crazy, but there are cases where it might work. You can describe your weights distribution as a fixed range of int values, and maintain a (memory) table of mappings from any int value from the range to the specific weighted entry. Then the query will round a random value to an int, the int value (being the primary key in the memory table) will point to some entry id. Obviously, the number of rows in the table will depend on the granularity of the weights, and you will have to update the whole table after any weight update. But in case a weight update is rare, this might be an option.

Upd2. I decided to show some SQL. Below are two working solutions. Say, we have a table:

CREATE TABLE entries (
  entry_id int(10) unsigned NOT NULL AUTO_INCREMENT,
  weight float NOT NULL DEFAULT 0.,
  data varchar(50),
  PRIMARY KEY (entry_id) USING BTREE,
  KEY weights (weight) USING BTREE
) ENGINE=InnoDB;

INSERT INTO entries (weight) VALUES (0.), (0.3), (0.1), (0.3), (0.0), (0.2), (0.1);

The best query we can think of will have a ready mapping from a rand() value to the specific entry_id. In this case all we need is to find an entry by the primary key. As I said, the table for such a query will take some space, but suppose we are ready for that. We might want to keep the mapping in memory, so we can use the MEMORY engine, which is using HASH index as the primary key (which is good as well, since we mant to map a value to the specific value).

Let's look at our table:

mysql> SELECT entry_id, weight FROM entries ORDER BY weight;
+----------+--------+
| entry_id | weight |
+----------+--------+
|        1 |      0 |
|        5 |      0 |
|        3 |    0.1 |
|        7 |    0.1 |
|        6 |    0.2 |
|        2 |    0.3 |
|        4 |    0.3 |
+----------+--------+

Let's create another table and fill it with values:

CREATE table int2entry (
  an_int int(10) unsigned NOT NULL AUTO_INCREMENT,
  entry_id int(10) unsigned NOT NULL,
  PRIMARY KEY (an_int)
) ENGINE=Memory;
TRUNCATE int2entry;
INSERT INTO int2entry (entry_id)
VALUES (3), (7), (6), (6), (2), (2), (2), (4), (4), (4);

The idea is that the number of references to the specific entry_id is proportional to the weight. The table might be hard to update using SQL only, and you have to truncate and update it after each weight change, but, as I said, this still can be an option when updates are rare. Here is the query to get the entry_id which you can then join to the entries table (you should know the number of rows in the mapping table):

SELECT entry_id
FROM (SELECT ceiling(rand() * 10) as an_int) as t1
JOIN int2entry USING (an_int);

Another solution is to use the cumulative weights and take advantage from the order in the index.

When we select data, the data is selected in some index order (for select * in the primary key order). The weights index is an ordered mapping from weights to entry_ids. If we select only weights and entry_ids, the values can be taken directly from the weights index, thats is the data will be read in the index order. We can use ORDER BY to force iteration in the reversed index order (the larger weights are stored in the end, but are going to be matched more frequent). Why it is important? Because we are going to add some hacky magic to the WHERE clause, and count on the specific order of processing rows:

SET @rand:= RAND(), @cum_weight:=0.;
SELECT entry_id, weight, @cum_weight, @rand 
FROM entries
WHERE @rand < @cum_weight:=@cum_weight+weight
ORDER BY weight DESC
LIMIT 1;

+----------+--------+----------------------------------+--------------------+
| entry_id | weight | @cum_weight                      | @rand              |
+----------+--------+----------------------------------+--------------------+
|        6 |    0.2 | 0.800000026822090100000000000000 | 0.6957228003961247 |
+----------+--------+----------------------------------+--------------------+

In practice, you need only entry id, that is the resulting query should be something like:

SELECT *
FROM entries
JOIN (
  SELECT entry_id
  FROM entries
  JOIN (SELECT @rand:= RAND(), @cum_weight:=0.) as init
  WHERE @rand < @cum_weight:=@cum_weight+weight
  ORDER BY weight DESC
  LIMIT 1) as rand_entry USING (entry_id);

Mind the LIMIT 1 which stops processing when the required entry is found.

Besides,

you should probably use DECIMAL type instead of FLOAT or DOUBLE to store weights to avoid undesired rounding errors on a large set of small weights.
you can hide the funny query into a stored function, and use cursors. The syntax might seem more comprehensible.

Can you actually start a lookup based on the backend BTREE? If that's the case, that's a nice way to gain a performance boost for little effort. Unfortunately, I doubt it's possible, and unfortunately it's still linear, albeit optimized for the average case. As for the integer mapping, I don't see the difference between that and a the cumulative weights + BETWEEN query described earlier. Thanks for the input still, it's something to think about! — concat, Dec 23 '14 at 02:39

score 0 · Answer 2 · answered Dec 23 '14 at 13:31

One solution would be to have a table containing intervals representing the relative size of weights. Then you can select entries based on the start of an interval. The intervals would need to be contiguous.

So for example, we might have a table containing the following rows, representing weights of 25%, 25% and 50%

TABLE intervals:

id  int_start  int_size 
1   0          95
7   95         95
9   190        190

So for a query, first we generate our random number in the range 0 <= rand_n < 1

And we need to obtain the total interval

SELECT int_start + int_size AS total_interval 
FROM intervals 
WHERE int_start = 
    SELECT MAX(int_start)
    FROM intervals

And now we can retrieve the id of the row we're going to use

SELECT id from intervals 
WHERE int_start = 
    (SELECT MAX(int_start)
    FROM intervals
    WHERE int_start <= :rand_n * :total_interval)

I think these SELECTs will be O(log n)

In practice it should be possible to combine the queries and use a HAVING clause.

Adding a new row is straightforward so long as the relative weights of the original rows are unchanged - it can just be added onto the end, i.e. with int_start = :total_interval

DELETEs require shifting down all entries with an int_start greater than that of the deleted row, O(n log n) I believe.

UPDATE intervals 
SET int_start = int_start - :inst_size_for_deleted_row 
WHERE int_start >= :int_start_for_deleted_row

And reweighting an entry will also be O(n log n) as a similar UPDATE to the one above will be required.

Select weighted random entry from MySQL table in sublinear time

2 Answers2