1

I am using MySQL 8.0.30 to build a database, and created several InnoDB data tables. These tables have the same table design (columns and column types), but contain different number of records. Now I need to add a new column for all tables, and update that column one record by one record since every record has different value of that column.

I used peewee 3.15.1 to perform the update. I only used two columns related to update, one is column id which is the primary key, the other one is the column col_a which needs to be updated. The col_a column has no index, and each table also does not have join relationships.

+--------------+--------------------+------+-----+---------+----------------+
| Field        | Type               | Null | Key | Default | Extra          |
+--------------+--------------------+------+-----+---------+----------------+
| id           | int                | NO   | PRI | NULL    | auto_increment |
| col_a        | varbinary(1024)    | NO   |     | NULL    |                |
+--------------+--------------------+------+-----+---------+----------------+

I performed the update in chuncks. let's say, variable MYSQLDB is peewee database object, table is peewee data table object which needs to be updated, and num is the total number of records in that table.

batch_size = 1e5
count = ceil(num / batch_size)

for i in tqdm(range(count)):
    this_min = int(i * batch_size)
    this_max = int((i+1) * batch_size)

    query = (table.select(table.id,
                          table.col_a)
                  .where(table.id > this_min,
                         table.id <= this_max))
    
    for j in range(len(query)):
        query[j].col_a = <corresponding binary string for that row>
    
    with MYSQLDB.atomic():
        table.bulk_update(query, fields=[table.col_a])

In each iteration, peewee will fetch 100K records and then update column col_a. I found the running time per iteration was highly correlated with the number of records in table.

Data Tables #Records Rough estimated running time per iteration (seconds)
table 1 60,329,700 227.75
table 2 4,325,292 108.48
table 3 3,433,891 101.94
table 4 18,787,232 147.28
table 5 25,605,277 168.04
table 6 8,042,938 128.80
table 7 10,652,355 135.54
table 8 1,068,277 107.63
table 9 1,189,886 117.33
table 10 1,926,438 115.07
table 11 7,590,218 118.50
table 12 9,404,617 133.26
table 13 1,033,066,290 5147.33

After exclude table 13, the Pearson correlation coefficient between running time per iteration and number of records in data table is 0.9815 (p-value = 1.649e-08).

scatter plot with regression line

I am new to MySQL, and my questions are:

  1. Why the running time of updating the same number of records is correlated with the number of total records in that table? Is it because the time for locking rows increases along with the table size?

  2. For updating table 13, the running time is not acceptable. In order to add the new column and fill corresponding values, should I dump all rows by mysqldump to a CSV file, then insert them back into a new table together with the new column?

Ningshan Li
  • 11
  • 1
  • 3
  • do you know if peewee bulk_update is actually updating one row at a time? if not, you could gain a lot by updating multiple rows at once – ysth Jan 03 '23 at 04:28
  • do you in fact need to do all the updates in a single transaction? – ysth Jan 03 '23 at 04:29
  • I don't need to update all rows in one transaction. The document of peewee [`bulk_update`](http://docs.peewee-orm.com/en/latest/peewee/api.html#Model.bulk_update) function don't explain clearly how the update is performed. I saw `bulk_update` were answers for two related questions ([Q1](https://stackoverflow.com/questions/52096473/bulk-update-using-peewee-library), [Q2](https://stackoverflow.com/questions/45427755/python-peewee-mysql-bulk-update)). I am not sure how to update multiple rows at once given the values needed to be updated for rows are different. – Ningshan Li Jan 03 '23 at 07:17
  • you update multiple rows by joining your table from values table constructor: https://dbfiddle.uk/247CuCb7 the entire select has to be less than @@max_allowed_packet, but even with 1024 byte binary values, you should be able to do 100 at a time – ysth Jan 03 '23 at 15:13
  • anyway, I can't answer your basic question about why the correlation with table size, but I can point out you have a number of factors to try removing and see what makes a difference: your orm (try raw multirow updates), locking (just do autocommit for each update statement), whatever tqdm might be doing – ysth Jan 03 '23 at 15:18
  • bulk_update() is rather complex under the hood and probably not well-suited to this task. I suggest reading the code for the method to get an understanding of how it works. – coleifer Jan 03 '23 at 17:42
  • 1
    Thanks for the suggestions @ysth @coleifer. I finally decide to read all columns from the original data table, add the corresponding values for the new column, then save into a new data table. Using peewee `insert_many` function gives acceptable running time even for the large table with 1 billion records. – Ningshan Li Jan 04 '23 at 08:57
  • I was going to suggest you could also use INSERT ON DUPLICATE KEY UPDATE (insert_many + on_conflict), docs here: http://docs.peewee-orm.com/en/latest/peewee/querying.html#upsert – coleifer Jan 04 '23 at 13:49

0 Answers0