The textbook solution would be to run two queries, one for rows with true
and one for rows with `false:
SELECT * FROM mytable WHERE `condition`=true ORDER BY RAND() LIMIT 5000;
SELECT * FROM mytable WHERE `condition`=false ORDER BY RAND() LIMIT 5000;
The WHERE clause applies first, to reduce the matching rows, then it sorts the subset of rows randomly and picks up to 5000 of them. The result is a random subset.
This solution has an advantage that it returns a pretty evenly distributed set of random rows, and it automatically handles cases like there being an unknown proportion of true to false in the table, and even handles if one of the condition values matches fewer than 5000 rows.
The disadvantage is that it's incredibly expensive to sort such a large set of rows, and an index does not help you sort by a nondeterministic expression like RAND().
You could do this with window functions if you need it to be a single SQL query, but it would still be very expensive.
SELECT t.*
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY `condition` ORDER BY RAND()) AS rownum
FROM mytable
) AS t
WHERE t.rownum <= 5000;
Another alternative that does not use a random sort operation would be to do a table-scan, and pick a random subset of rows. But you need to know roughly how many rows match each condition value, so that you can estimate the fraction of these that would make ~5000 rows. Say for example there are 1 million rows with the true value and 500k rows with the false value:
SELECT * FROM mytable WHERE `condition`=true AND RAND()*1000000 < 5000;
SELECT * FROM mytable WHERE `condition`=false AND RAND()*500000 < 5000;
This is not guaranteed to return exactly 5000 rows, because of the randomness. But probably pretty close. And a table-scan is still quite expensive.
The answer from O.Jones gives me another idea. If you can add a column, then you can add an index on that column.
ALTER TABLE `table`
ADD COLUMN rando FLOAT DEFAULT NULL,
ADD INDEX (`condition`, rando);
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then you can use indexed searches. Again, you need to know how many rows match each value to do this.
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/1000000
ORDER BY `condition`, rando
LIMIT 5000;
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/500000
ORDER BY `condition`, rando
LIMIT 5000;
The ORDER BY in this case should be a no-op if the index I added is used. The rows will be read in index order anyway, and MySQL's optimizer will not do any work to sort them.
This solution will be much faster, because it doesn't have to sort anything, and doesn't have to do a table-scan. MySQL has an optimization to bail out of a query once the LIMIT has been satisfied.
But the disadvantage is that it doesn't return a different random result when you run the SELECT again, or if different clients run the query. You would have to use UPDATE to re-randomize the whole table to get a different result. This might not be suitable depending on your needs.