INSERT vs UPDATE: MySQL, 7 million rows

Question

For those opting that is is not a programming question, it ultimately is as I have to 1) code, 2) the amount of algorithms of comparing data for each row in option #1. I am not asking how to do queries or how my database should be setup.

I have run into a bit of a pickle. I have a database which follows Google GTFS specs and I am now writing an automatic update program to service this database.

The database gets an overhaul once every 3 months. The table with the least amount of rows consists of between 1-10 rows and the largest table contains 7 million rows. The rest have between 10 000 and 80 000.

The files my program to download are .txt files which translate in a table. In other words: stops.txt = database.stops. Database is InnoDB type.

I have come up with 2 solutions.

1) every row id in the several .txt documents is to be compared to what is currently in the database, if nothing has changed do nothing, if something has changed, do an UPDATE.
2) INSERT the downloaded files into their own tables (basically mirroring the live tables) and then switch table names. Example: database.stop_new and database.stop switch names.

Another twist: The overhaul will be done at a certain date defined in one of the .txt files, this means it could be done a Saturday on a peak holiday prehaps, meaning that users could be sending queries at all times.

Question: Which approach should I go for to endure that nothing breaks and that the user experience is upheld. (I am leaning towards option 2...)

Another option would be to use INSERT....ON DUPLICATE KEY UPDATE syntax, and then you can do use one sql statement to deal with hundreds of records at a time. — Kickstart, Jan 31 '14 at 15:10

score 2 · Answer 1 · answered Jan 31 '14 at 14:48

A lot depends on the volume of updates. If only 10 records change out of 7 million, doing the update might be reasonable.

However, I want to advocate for the second approach. Or, perhaps, for a variation on the second approach.

First, determining what records are the same and which are changed is a lot of work. This might mean downloading all the data into a table, running complicated queries to find the changes, and then updating the table to do the updates. That is a lot of processing overhead.

The second approach downloads the data. Gets the tables in the right format. And then switches over. For the switch, I would lock the tables to prevent other connections from using them, rename the tables, and then unlock everything. This should take a few seconds.

If you can handle minutes of downtime, you might want to truncate the existing tables and then insert the new data into them. This has the advantage of preserving security access, indexes, triggers, and other information about the tables.

I will definitely have this in thought! – basickarl Jan 31 '14 at 14:51 — basickarl, Jan 31 '14 at 14:51

score 1 · Accepted Answer · answered Jan 31 '14 at 14:33

According to MySQL documentation...

MySQL uses table-level locking for MyISAM, MEMORY and MERGE tables, page-level locking for BDB tables, and row-level locking for InnoDB tables.

You will be able to row-level lock the table, instead of rendering the entire contents unusable...

Table updates are given higher priority than table retrievals.

Updates take priority over selects, and is also based on key values, so i think this would be a much better option.

> Advantages of row-level locking:

Fewer lock conflicts when different sessions access different rows

Fewer changes for rollbacks

Possible to lock a single row for a long time

> Disadvantages of row-level locking:

Requires more memory than page-level or table-level locks

Slower than page-level or table-level locks when used on a large part of the table because you must acquire many more locks

Slower than other locks if you often do GROUP BY operations on a large part of the data or if you must scan the entire table frequently

However, in general table-locks are superior to row-level locks according to MySQL Documentation...

Another option...

Instead of using row-level locks, you can employ application-level locks, such as those provided by GET_LOCK() and RELEASE_LOCK() in MySQL. These are advisory locks, so they work only with applications that cooperate with each other. See Section 12.14, “Miscellaneous Functions”.

Thank you, you answered perfectaly! Loved the pros and cons, plus I feel a little smarter now. — basickarl, Jan 31 '14 at 14:45

INSERT vs UPDATE: MySQL, 7 million rows

2 Answers2

Linked