1

I have a very large table with about 6.6 million records and I want to select a random sample of 100,000 records

SELECT column FROM table
ORDER BY RAND()
LIMIT 100000

Is EXTREMELY SLOW on each record.

I have not found a solution that works with MySQL/MariaDB to extract a random sample of 100,000 records.

Please advise.

Thank you.

Viktor
  • 517
  • 5
  • 23
  • How often/when would you have the need to run this query? You could keep a materialized view around, and update it every so often, before you select from it. – Tim Biegeleisen Jan 12 '22 at 02:40
  • I don't understand what you mean by "materialized view". We need to run this query once per day. Are you suggesting we split the table into smaller tables first randomly? if so then, what would be the MySQL command to do that? – Viktor Jan 12 '22 at 02:43
  • No...I'm saying you pre-run the query in a materialized view. Then, when you need to use it, just select against that view, which should be very fast. – Tim Biegeleisen Jan 12 '22 at 02:45
  • 1
    `ORDER BY RAND()` will always be proportional to the amount of rows to determine “which” is ordered first. It will *require lots of memory/temp for sorting*. Here is an answer I wrote for *SQL Server* that shows an alternative approach, that if assuming some bias, and some potential optimizations can scale well. YMMV on adapting to the particular use case (in mysql). — https://stackoverflow.com/questions/848872/select-n-random-rows-from-sql-server-table/66179188#66179188 – user2864740 Jan 12 '22 at 03:12
  • @TimBiegeleisen What is "materialized view" in MySQL? – Akina Jan 12 '22 at 05:40
  • Does the goal is to select 100k random rows only? or their random ordering is a goal too? – Akina Jan 12 '22 at 05:43
  • Why would you optimize a query that runs once a day? – Salman A Jan 12 '22 at 08:23
  • 1
    Multiple solutions that do not involve full scans: http://mysql.rjweb.org/doc.php/random – Rick James Jan 12 '22 at 20:08
  • @Viktor Rick James link above should guide you to AVOID your current use of RAND() and provide speed for your query to complete. – Wilson Hauck Jan 13 '22 at 21:11

1 Answers1

0

You may try to decrease the amount of rows to be sorted.

Example:

  1. Create source data
mysql> create table test (id int auto_increment primary key, val int);
Query OK, 0 rows affected (0.05 sec)

mysql> set @@cte_max_recursion_depth := 10000000;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into test (val) with recursive cte as ( select 1 num union all select num+1 from cte where num < 6600000 )select rand() * 1000000000 from cte;
Query OK, 6600000 rows affected (1 min 48.62 sec)
Records: 6600000  Duplicates: 0  Warnings: 0

mysql> create table tmp (val int);
Query OK, 0 rows affected (0.05 sec)
  1. Insert without sorting
mysql> insert into tmp select val from test limit 100000;
Query OK, 100000 rows affected (1.93 sec)
Records: 100000  Duplicates: 0  Warnings: 0
  1. Insert with random sorting
mysql> insert into tmp select val from test order by rand() limit 100000;
Query OK, 100000 rows affected (26.31 sec)
Records: 100000  Duplicates: 0  Warnings: 0
  1. Insert with random selection (1.1 is overage coefficient)
mysql> insert into tmp select val from test where rand() < 1.1 * 100000 / 6600000 limit 100000;
Query OK, 100000 rows affected (15.89 sec)
Records: 100000  Duplicates: 0  Warnings: 0
  1. Insert with random selection (1.1 is overage coefficient) and random sorting
mysql> insert into tmp select val from test where rand() < 1.1 * 100000 / 6600000 order by rand() limit 100000;
Query OK, 100000 rows affected (19.26 sec)
Records: 100000  Duplicates: 0  Warnings: 0

Overage coefficient may be adjusted. If you decrease it then you'll improve the query (slightly) but the probability that the amount of output rows will be less than needed 100k rows will increase.

Akina
  • 39,301
  • 5
  • 14
  • 25
  • What is the memory requires for that recursive CTE? – Rick James Jan 12 '22 at 20:09
  • @RickJames Process mysqld.exe, System Commit (reported by ProcessExplorer) before the query = 522.9 Mb, during the execution = 780.2-782.7 Mb, after = 526.2 Mb. – Akina Jan 12 '22 at 20:21
  • Thanks. I wonder if that implies 30 bytes per level of depth. I wish MySQL would pick up MariaDB's `seq_1_to_6600000` table. – Rick James Jan 12 '22 at 20:48
  • The "Rows_accessed" for tests 3,4,5 is on the order of 6,600,000; that is the reason for taking much longer than test 2 (100,000 rows). Virtually all simple ways to fetch random rows involve a full table scan. Arrange for your queries to show up in the slowlog. Or wrap in `FLUSH STATUS;` and `SHOW VARIABLES LIKE 'Handler%';` – Rick James Jan 12 '22 at 20:52