I am using MySQL 8.0.30 to build a database, and created several InnoDB data tables. These tables have the same table design (columns and column types), but contain different number of records. Now I need to add a new column for all tables, and update that column one record by one record since every record has different value of that column.
I used peewee 3.15.1 to perform the update. I only used two columns related to update, one is column id
which is the primary key, the other one is the column col_a
which needs to be updated. The col_a
column has no index, and each table also does not have join relationships.
+--------------+--------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| col_a | varbinary(1024) | NO | | NULL | |
+--------------+--------------------+------+-----+---------+----------------+
I performed the update in chuncks. let's say, variable MYSQLDB
is peewee
database object, table
is peewee
data table object which needs to be updated, and num
is the total number of records in that table.
batch_size = 1e5
count = ceil(num / batch_size)
for i in tqdm(range(count)):
this_min = int(i * batch_size)
this_max = int((i+1) * batch_size)
query = (table.select(table.id,
table.col_a)
.where(table.id > this_min,
table.id <= this_max))
for j in range(len(query)):
query[j].col_a = <corresponding binary string for that row>
with MYSQLDB.atomic():
table.bulk_update(query, fields=[table.col_a])
In each iteration, peewee
will fetch 100K records and then update column col_a
. I found the running time per iteration was highly correlated with the number of records in table.
Data Tables | #Records | Rough estimated running time per iteration (seconds) |
---|---|---|
table 1 | 60,329,700 | 227.75 |
table 2 | 4,325,292 | 108.48 |
table 3 | 3,433,891 | 101.94 |
table 4 | 18,787,232 | 147.28 |
table 5 | 25,605,277 | 168.04 |
table 6 | 8,042,938 | 128.80 |
table 7 | 10,652,355 | 135.54 |
table 8 | 1,068,277 | 107.63 |
table 9 | 1,189,886 | 117.33 |
table 10 | 1,926,438 | 115.07 |
table 11 | 7,590,218 | 118.50 |
table 12 | 9,404,617 | 133.26 |
table 13 | 1,033,066,290 | 5147.33 |
After exclude table 13, the Pearson correlation coefficient between running time per iteration and number of records in data table is 0.9815 (p-value = 1.649e-08).
I am new to MySQL, and my questions are:
Why the running time of updating the same number of records is correlated with the number of total records in that table? Is it because the time for locking rows increases along with the table size?
For updating table 13, the running time is not acceptable. In order to add the new column and fill corresponding values, should I dump all rows by
mysqldump
to a CSV file, then insert them back into a new table together with the new column?