The impact of Availability of data to users while doing large updates on table

Question

I am trying to understand how the huge volume of updates in tables affects Data availability for users. I have been going through various posts(fastest-way-to-update-120-million-records, Avoid locking while updating)which walks through the different mechanisms to do large updates like populating completely new table if this can be done offline. If it cannot be offline then doing batch updates.

I am trying to understand how these large updates affects Table availability to user and what is the best way to do large updates while making sure Table is available for reads.

Use case: Updating transaction details based on Primary key (Like updating the stock holding due to stock split.)

Don't look at either link -- they refer to SQL Server; the tips may be irrelevant to MySQL. Also, the first question is 10 years old and discusses a 15-year-old product. — Rick James, May 09 '20 at 04:26
The use case is something like `UPDATE t SET price = 2 * price WHERE date < '...' AND ticker = '...'` ? — Rick James, May 09 '20 at 17:26

Rick James · Accepted Answer · 2020-05-09T17:38:28.347

It is unclear what you need to do.

Replace the entire table -- populate new table, then swap
Change one column for all rows -- Sounds like sloppy design. Please elaborate on what you are doing.
Change one column for some rows -- ditto.
Adding a new column and initializing it -- Consider creating a parallel table, etc. This will have zero blockage but adds some complexity to your code.
The values are computed from other columns -- consider a "generated" column. (What version of MySQL are you using?)

Here is a discussion of how to walk through a table using the PRIMARY KEY and have minimal impact on other queries: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks (It is written with DELETE in mind, but the principle applies to UPDATE, too.)

Table availability

When any operation occurs, the rows involved are "locked" to prevent other queries from modifying them at the same time. ("Locking involves multi-version control, etc, etc.) They need to stay locked until the entire "transaction" is completed. Meanwhile, any changes need to be recorded in case the server crashes or the user decides to "roll back" the changes.

So, if there are millions of rows are being changed, then millions of locks are being held. That takes time.

My blog recommends doing only 1000 rows at a time; this is usually a small enough number to have very little interference with other tasks, yet large enough to get the task finished in a reasonable amount of time.

Stock Split

Assuming the desired query (against a huge table) is something like

UPDATE t
    SET price = 2 * price
    WHERE date < '...'
      AND ticker = '...'

You need an index (or possibly the PRIMARY KEY) to be (ticker, date). Most writes are date-oriented, but most reads are ticker-oriented? Given this, the following may be optimal:

PRIMARY KEY(ticker, date),
INDEX(date, ticker)

With that the rows that need modifying by the UPDATE are 'clustered' (consecutive) in the data's BTree. Hence there is some degree of efficiency. If, however, that is not "good enough", then it should be pretty easy to write code something like:

date_a = SELECT MIN(date) FROM t WHERE ticker = ?
SET AUTOCOMMIT=ON
Loop
    date_z = date_a + 1 month
    UPDATE t
        SET price = 2 * price
        WHERE date >= ?    -- put date_a here
          AND date <  ?    -- put date_z here
          AND ticker = '...' 
    check for deadlock; if found, re-run the UPDATE
    set date_a = date_z
    exit loop when finished
End Loop

This will be reasonably fast, and have little impact on other queries. However, is someone looks at that ticker over a range of days, the prices may not be consistently updated. (If this concerns you; we can discuss further.)

I modified the question based on your comment and thanks for providing the blog it is really helpful. — Explorer, May 09 '20 at 05:50
When any operation occurs, the rows involved are "locked" -> Does this locked gets applied to the entire batch at the same time? or one record at a time? — Explorer, May 09 '20 at 15:20
@Explorer - yes and yes. OK, let me see if I can unravel that contradiction. Logically, the entire transaction happens instantaneously and all the locks happen simultaneously. But, under the covers, it is one row at a time. The classic deadlock can happen with 2 rows: one thread locks the rows in one order; the other locks them in the opposite order. ((This is off-topic. Please do some reading, then start a new Question if you want to discuss further.)) — Rick James, May 09 '20 at 17:23

The impact of Availability of data to users while doing large updates on table

1 Answers1