1

I know there are a lot related questions with many answers, but I have a bit of a more nuanced question. I have been doing reading on different insert techniques for mass records, but are there limits on how big a query insert can be? Can the same technique be used for REPLACE and INSERT ...ON DUPLICATE KEY UPDATE ... ? Is there a faster method?

Table:

+-----------+-------------+------+-----+---------+----------------+
| Field     | Type        | Null | Key | Default | Extra          |
+-----------+-------------+------+-----+---------+----------------+
| a         | int(11)     | NO   | PRI | NULL    | auto_increment |
| b         | int(11)     | YES  |     | NULL    |                |
| c         | int(11)     | YES  |     | NULL    |                |

#1

1) "INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3);"
2) "INSERT INTO TABLE COLUMNS (a,b,c) values (5,6,7);"
3) "INSERT INTO TABLE COLUMNS (a,b,c) values (8,9,10);"
 ...
10,000) "INSERT INTO TABLE COLUMNS (a,b,c) values (30001,30002,30003);"

or

#2 - should be faster, but is there a limit?

"INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3),(4,5,6),(8,9,10)....(30001,30002,30003)" ;

From a scripting perspective (PHP), using #2, is it better to loop through and queue up 100 entries (1000 times)...or a 1000 entries (100 times), or just all 10,000 at once? Could this be done with 100,000 entries?

Can the same be used with REPLACE:

"REPLACE INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6)(7,8,9),...(30001,30002,30003);"

Can it also be used with INSERT ON DUPLICATE?

INSERT INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003) ON DUPLICATE KEY UPDATE (b=2,c=3)(b=5,c=6),(b=8,c=9),....(b=30002,c=30003) ?

For any and all of the above (assuming the replace/on duplicate are valid), are there faster methods to achieve the inserts?

rolinger
  • 2,787
  • 1
  • 31
  • 53
  • Which case do you really have? A plain insert of all-new rows? Or might you need to replace/update existing rows? – Rick James Sep 14 '22 at 04:38

2 Answers2

2

The length of any SQL statement is limited by a MySQL option called max_allowed_packet.

The syntax of INSERT allows you to add an unlimited number of tuples after the VALUES clause, but the total length of the statement from INSERT to the last tuple must still be no more than the number of bytes equal to max_allowed_packet.

Regardless of that, I have found that LOAD DATA INFILE is usually significantly faster than any INSERT syntax. It's so much faster, that you might even find it faster to write your tuples to a temporary CSV file and then use LOAD DATA INFILE on that CSV file.

You might like my presentation comparing different bulk-loading solutions in MySQL: Load Data Fast!

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • My form offers three options for uploading a CSV file. 1. Upload New or Delete existing and upload new. 2. Upload new and merge without updating duplicates 3) Upload new and merge and update duplicates. This task is what led me down this path So for #1 I am loading direct from file. But for #2 and #3 I need to do comparisons to check `if not exist then insert` and finally `if not exist insert, but if exists then update`. Thus my questions related to `REPLACE` and `ON DUPLICATE` – rolinger Sep 12 '22 at 18:58
  • Bill, that's even [documented](https://dev.mysql.com/doc/refman/8.0/en/insert-optimization.html): "When loading a table from a text file, use LOAD DATA. This is usually 20 times faster than using INSERT statements." – Mihe Sep 12 '22 at 19:04
  • @rolinger, I encourage you to look at the [documentation for LOAD DATA](https://dev.mysql.com/doc/refman/8.0/en/load-data.html). It has an optional keyword `REPLACE` that might do what you want. – Bill Karwin Sep 12 '22 at 19:45
  • 1
    @BillKarwin - thanks. I think all three of my needs can be satisfied with this. #1 = delete all records then `LOAD DATA LOCAL INFILE $file ....`, #2 = merge only new records: `LOAD DATA LOCAL INFILE $file IGNORE...` and #3 = merge new records and update existing records: `LOAD DATA LOCAL INFILE $file REPLACE...` – rolinger Sep 12 '22 at 20:49
1

#1 (single-row inserts) -- Slow. A variant is INSERT IGNORE -- beware: it burns AUTO_INCREMENT ids.

#2 (batch insert) -- Faster than #1 by a factor of 10. But do the inserts in batches of no more than 1000. (After that, you are into "diminishing returns" and may conflict with other activities.

#3 REPLACE -- Bad. It is essentially a DELETE plus an INSERT. Once IODKU was added to MySQL, I don't think there is any use for REPLACE. All the old AUTO_INCREMENT ids will be tossed and new ones created.

#4 IODKU (Upsert) -- [If you need to test before Insert.] It can be batched, but not the way you presented it. (There is no need to repeat the b and c values.)

INSERT INTO (
INSERT INTO TABLE (a, b, c)
     VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003)
     ON DUPLICATE KEY UPDATE
         b = VALUES(b),
         c = VALUES(c);

Or, in MySQL 8.0, the last 2 lines are:

         b = NEW.b,
         c = NEW.c;

IODKU also burns ids.

MySQL LOAD DATA INFILE with ON DUPLICATE KEY UPDATE discusses a 2-step process of LOAD + IODKU. Depending on how complex the "updates" are, 2+ steps may be your best answer.

#5 LOAD DATA -- as Bill mentions, this is a good way if the data comes from a file. (I am dubious about its speed if you also have to write the data to a file first.) Be aware of the usefulness of @variables to make minor tweaks as you do the load. (Eg, STR_TO_DATE(..) to fix a DATE format.)

#6 INSERT ... SELECT ...; -- If the data is already in some other table(s), you may as well combine the Insert and Select. This works for IODKU, too.

As a side note, if you need to get AUTO_INCREMENT ids of each batched row, I recommend some variant on the following. It is aimed at batch-normalization of id-name pairs that might already exist in the mapping table. Normalization

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks for the detailed explanations. This helps. My orig post was derived from trying to figure out the best way to import csv files - dealing with speed, duplicates, updates and new inserts. Thus @BillKarwan answer got me down the right path and I rewrote everything to now use `LOAD DATA LOCAL INFILE`. But I am now using `REPLACE` for updates and `IGNORE` for duplicates. Based on what you wrote though, how can I use IODKU with LOAD DATA? It appears `ON DUPLICATE` isn't supported for it. – rolinger Sep 14 '22 at 11:49
  • 1
    I found this as a work around, is there a better technique now? - this post is quite old. https://stackoverflow.com/questions/15271202/mysql-load-data-infile-with-on-duplicate-key-update – rolinger Sep 14 '22 at 11:52
  • 1
    @rolinger - Thanks; I folded a link to that Q+A into my Answer. Yes, I agree with it. -- LOAD is very fast; IODKU is very powerful. Between them (and may be some other steps -- but no loops -- you should have a fast _and_ powerful way to process the updates. – Rick James Sep 14 '22 at 15:03
  • I got all this working using the the temporary table method then doing the IODKU `insert in to mainTable SELECT * FROM tempTable` - but I have a question. In mainTable I have three indexes: `primary key=uid, unique key (listID,phone), unique key (listID,email)`. Each list, in `mainTable` can only have one email or phone number, no repeats. Does `ON DUPLICATE KEY` force both primary key and unique keys to all be unique....or is it only applying to Primary Key or a first match (to determine a duplicate). In my setup, I really don't even care about the Primary Key. – rolinger Sep 14 '22 at 20:53
  • 1
    @rolinger - Yes, so beware. Any or all of the 3 unique keys will trigger the "update" side of IODKU. Please provide `SHOW CREATE TABLE` to discuss this futher. (If the PK is an auto_increment that is not used in any JOINs, maybe you should get rid of it, and promote one of the UNIQUEs to be the PK.) – Rick James Sep 15 '22 at 00:25
  • Cool. Thanks for the clarity and your help again. – rolinger Sep 15 '22 at 12:43
  • The speed advantage of LOAD DATA over INSERT is greater if you have larger files. I agree that for a medium-sized dataset it might not be worthwhile to write it out to a file first, because that takes time too. But if the dataset is _very_ large, the time taken to write it to a file may be less than the difference in performance. Also it depends on how speedy your filesystem storage is. – Bill Karwin Sep 15 '22 at 16:36
  • @BillKarwin - Thanks for the insight. I did not realize that it might not be linear with table size. I have yet to encounter (or create) a benchmark that provides clear evidence of what you say. I worry that there are too many variables to make a clean benchmark (num columns, app language's write speed, SSD vs HDD, secondary indexes, data sorted by PK, etc, etc). I suspect that whether the data is already sorted by PK (or using provided a_i) is a big variable. Furthermore, I don't understand why mysqldump seems to add the PK _after_ loading the data. – Rick James Sep 15 '22 at 17:53
  • Also number of indexes in the table, and performance of CPU and I/O on the given server. It's hard to make a benchmark that is informative, when there are so many factors that can affect the result. – Bill Karwin Sep 15 '22 at 17:57