I am pondering about performing a zero-downtime database migration and came up with minimum necessary steps.
By "migration" I mean any change in the same database that is not backward-compatible such as renaming, splitting or dropping a column.
Since I couldn't find much information elsewhere, I would like to validate my reasoning with someone having hands-on experience with this. Let's imagine we have a capability to perform rolling deployments, otherwise I don't believe that zero downtime DB migration is possible. So:
- Initial state: V1 is deployed in prod. It uses
table1.oldColumn
- Goal: rename
table1.oldColumn
totable1.newColumn
with zero downtime
Steps:
Create
table1.newColumn
:ALTER TABLE table1 ADD COLUMN newColumn(...)
Gradually deploy V2. The V2 code contains the following changes:
- SELECTs use oldColumn:
SELECT oldColumn FROM table1 WHERE userId = 1001
. That's because onlyoldColumn
contains full data for now whilenewColumn
contains only a subset of it - UPDATEs use both, but when a new value is missing in newColumn, it's copied from
oldColumn
. If we don't do that, we will chase constantly changingoldColumn
forever - INSERTs use both columns:
INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd')
- DELETEs are usually irrelevant because the delete remove the entire row:
DELETE FROM table1 WHERE userId = 1001
- However, if the column is a UNIQUE KEY, then the oldColumn is used:
DELETE FROM table1 WHERE oldColumn = 'xyz'
- However, if the column is a UNIQUE KEY, then the oldColumn is used:
- SELECTs use oldColumn:
Now that all new data is always in sync, we still have a diff between
oldColumn
andnewColumn
. In order to liquidate difference betweenoldColumn
andnewColumn
, we run a background script copying values missing innewColumn
fromoldColumn
Now that columns are in sync, gradually deploy V3. V3 code contains the following changes: SELECTs, UPDATEs, INSERTs and DELETEs go to
newColumn
now.table1.oldColumn
is not used anymoreDrop the unused
table1.oldColumn
:ALTER table1 DROP COLUMN oldColumn
Note: steps 3 and 5 can be performed as part of the database migration during V2 and V3 startup
Recap:
- Initially
newColumn
is empty and all data goes tooldColumn
- While we gradually replace V1 with V2, data starts to flow into
oldColumn
alongsidenewColumn
. At this point some data still flows intooldColumn
only (because we are performing a rolling update so not all instances are V2) - As soon as V2 is deployed, data flows in both
oldColumn
andnewColumn
. We mirror updates and inserts to keep columns in sync - However, some data was inserted into
oldColumn
beforenewColumn
was devised and some data got there from remaining V1 instances that existed during the rolling update. We must get rid of this difference - When the script is run, data in
oldColumn
missing innewColumn
gets copied there