1

I have a large database (MySQL, Aurora serverless) and I would like to get random rows (like 1 or 5) I know that using SORT BY RAND() is very slow, so that’s discarded.

I also know that here some tricks use the identifier of the row, but this is only working when the id is an integer autoincremented.

In my case, my database uses BINARY(16) as an identifier/primary key, and it is a randomly generated hash.

The thing is, what should I do to retrieve random rows for this configuration?

Note that in my case speed is more important than accuracy, so if it is not a perfectly random row, it is not a big issue.

Some ideas I have that I don’t know if they are good or bad:

-Every time I add a new row, I also add an extra column that uses RAND(), and I use that field to sort. Problem is, this will generate the same random rows again and again. Unless I update that field regularly. Seems too complex.

-Send 2 requests. The first one to get the oldest createdAt date. Then, the second one, sort it using a random date between the oldest one and now. This is not 100% accurate because creation dates are not distributed uniformly, but as I said, speed is more important than accuracy in my use case.

-Somehow, use my ids, because they are already random, perhaps I can sort starting from a random bit. No idea.

What do you think? Do you have more ideas? Thanks.

Ricardo
  • 2,831
  • 4
  • 29
  • 42
  • 2
    Does this answer your question? [MySQL select 10 random rows from 600K rows fast](https://stackoverflow.com/questions/4329396/mysql-select-10-random-rows-from-600k-rows-fast) – Kraigolas Jun 03 '21 at 23:01
  • define "large" database? some people say large to mean 1000 rows, some to mean a billion rows; there's a big difference – ysth Jun 03 '21 at 23:03
  • @ysth With large I mean at least half million rows. – Ricardo Jun 03 '21 at 23:11
  • @kraigolas do you mean the second answer? Count plus offset? – Ricardo Jun 03 '21 at 23:12
  • I think using offset is a bad idea. Here more info: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/ – Ricardo Jun 03 '21 at 23:19
  • 1
    @Ricardo have you actually tried just using `select id from foo order by rand() limit 5`? or are you just theorizing that it will be too slow? I'm seeing 100ms times on a not super powerful server with a table with 500k rows. but a lot of things could make a difference in that – ysth Jun 03 '21 at 23:28
  • @ysth Actually, you are right. I was expecting something very slow, but 500k rows, order by rand() limit 1, takes 0.1 seconds. I am very surprissed. – Ricardo Jun 04 '21 at 19:54
  • @ysth Well, if I retrieve the full row (like 25 fields), it takes 10 seconds. But if I only retrieve the id, then it takes 0.1 seconds. This is very strange. – Ricardo Jun 04 '21 at 20:13
  • 1
    @Ricardo when ordering by rand, it really does have to do all the work as if it were returning the entire table before applying the limit, including reading the actual table data, not just the index. You can do `select * from foo where id in (select id from foo order by rand() limit 5)` and that should be efficient – ysth Jun 04 '21 at 20:19
  • @ysth Thanks a lot, I was thinking about doing that, but I receive: Database error code: 1235. Message: This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'. Any suggestion? Thanks! – Ricardo Jun 04 '21 at 21:04
  • 1
    ah, I forgot about that. you can just join a subquery instead though `select yourtable.* from (select id from yourtable order by rand() limit 5) ids join yourtable using (id);` – ysth Jun 05 '21 at 00:37
  • What indexes do you currently have? – Rick James Jun 10 '21 at 20:34

2 Answers2

1

If your ids are truly random, you can just pick a random value and find the first id greater than or equal to that. And if your random value happens to be greater than any ids in the table, try again.

Ideally you pick the random value in your code, but unhex(md5(rand())) is a quick hack that should produce a random 16 byte string:

select id
from yourtable
where id >= unhex(md5(rand()))
order by id
limit 1
ysth
  • 96,171
  • 6
  • 121
  • 214
  • One question, shouldnt I retrieve firstly the lowest and higest value and the use than to generate a random value in my code so I can use it in the second request? Thanks! @ysth – Ricardo Jun 03 '21 at 23:21
  • 1
    you could, but if they are truly random strings, it shouldn't matter (except that you would know to never try a value higher than the max, but it's just as easy to find that case by noticing when you get no result) – ysth Jun 03 '21 at 23:24
1

If your id's are pretty uniformly distributed, you could generate a new random id and then do a query like:

SELECT * FROM mytable WHERE id > ? LIMIT 1

If you need multiple random rows (you said between 1 and 5), then repeat the query, generating a new random id for each try.

Check that the query in fact returned one row, to account for the case where your random id is greater than the last id stored in the table. In that case, retry.

Check for duplicates, and retry if you get one. It should be rare to choose the same row out of half a million rows multiple times, so the overhead of the retries is minimal.

There's also a risk if you want N random rows, but the total number of rows in the table is less than N. If your retry-on-duplicate logic doesn't account for this, you could create an infinite loop.

If the id's are not uniformly distributed, this technique has a higher chance of choosing a rows that is preceded by a largish gap. So it's not a very accurate randomizer.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828