Fast splitting a MySQL table based on the values of a column

Question

I have a fairly large MySQL table (~600G) on my own computer (Win10) with the following structure.

  id var1  var2 var3
   a  val1  1    5
   b  val1  2    6
   c  var2  3    7
   d  var2  4    8

both id and var1 are indexed. I want to split this table into several sub-tables based on the values of var1. That is,

for table table_var1:

id var1  var2 var3
a  val1  1    5
b  val1  2    6

for table 'table_var2':

id var1  var2 var3
c  val2  3    7
d  val2  4    8

I used the following code

CREATE TABLE table_var1 LIKE original_table;
INSERT INTO  table_var1 SELECT * FROM original_table where var1=val1;


CREATE TABLE table_var2 LIKE original_table;
INSERT INTO  table_var2 SELECT * FROM original_table where var1=val2;

My question is very similar to this. I want to speed up the splitting of the table, but since the database is on my own computer, if I'm not wrong, partition is not really helpful (which is more helpful when there are several physical harddisks available?).

Is there any suggestion for improving the performance of table splitting?

Why do you want to split the table? I can't think of a good reason to do that — ysth, Nov 29 '21 at 03:28
@ysth I need to speed up the splitting. I need to do further processing on some columns (e.g., create more columns to the table based on var2 or var3), however, the actions will be different for different var1 values, so, I want to split this table into sub-tables and do these actions separately on the sub-tables. — Ding Li, Nov 29 '21 at 03:38
no, what is the problem you hope to address by doing the splitting? sub-tables in general are a terrible idea and make many things more cumbersome — ysth, Nov 29 '21 at 04:19

Rick James · Answer 1 · 2021-11-29T19:52:51.083

Yes, your two steps are probably the fastest way. Even faster is a parallel approach. (More in a minute.)

It is probably best to have the PRIMARY KEY defined in the CREATE TABLE, but delay adding secondary keys until after populating the new tables.

Be sure you have lots of disk space for each step. -- maybe 700GB for the split, then a smaller amount for adding secondary indexes.

Have innodb_buffer_pool_size set to about 70% of RAM.

If original_table and the new tables all have PRIMARY KEY(id) (or at least starting with id), the Insert..Select should be a table scan and very efficient for I/O, CPU, and buffer_pool.

How many new tables will there be?

Assuming no more than 20% of the table has a specific value for var1, the index on var1 will be ignored; don't worry about it. A "table scan" will be more efficient than using the index.

It may be possible to speed up the process further by running all the INSERT..SELECTs simultaneously (from different connections). Assuming the buffer_pool is less than the 600GB of table size, sequential scans would involve multiple full loads of that table -- lots of I/O. Parallel scans would (probably) lead to fetching original_table only once.

(Use the commandline tool "mysql" so you can run in parallel. Workbench does not seem practical for this.)

That is, "count the disk hits" comes into play in your particular task. The sequential approach would take about (N+1)x600GB of data being shoveled. The parallel approach would involve only about 2x600GB of I/O. That is 600GB read + N x 600GB / N written to each new table.

If you have secondary indexes, that complicates the sequential vs parallel tradeoffs. Please advise on indexes before I stress my brain to come up with a prediction.

Fast splitting a MySQL table based on the values of a column

1 Answers1