10

I am trying to insert a lot of users into a MySQL database with two tables:

The first table contains the user data. An example INSERT looks like this (id is the primary key, mail is a unique key):

INSERT INTO users (id, mail, name)  
VALUES (NULL, "foo@bar.tld", "John Smith") 
ON DUPLICATE KEY UPDATE name = VALUE(name)

The second table contains the group the user belongs to. It only stores two foreign keys users_id and groups_id. An example query looks like this:

INSERT INTO users_groups (users_id, groups_id)
VALUES (LAST_INSERT_ID(), 1)

This setup works perfectly fine for small data sets. When I import large amounts of data (>1M rows) the INSERTs get slow. Obviously, it would be much better to do a batch insert:

INSERT INTO users (id, mail, name)  
VALUES (NULL, "foo@bar.tld", "John Smith"), (NULL, "baz@qux.tld", "Anna Smith") 
ON DUPLICATE KEY UPDATE name = VALUE(name)

and:

INSERT INTO users_groups (users_id, groups_id)
VALUES (LAST_INSERT_ID(), 1), (LAST_INSERT_ID(), 4)

The problem of course is, that LAST_INSERT_ID() only returns one (the first) id of a batch INSERT.
So, what I would need is a "nested" batch INSERT, which IMO does not exist in MySQL.

What can I do to make my INSERTs faster?

Horen
  • 11,184
  • 11
  • 71
  • 113
  • if you have an index on the mail column for example, you could just go with an `insert select` statement to fill up the group table. In the select statement, you would then use a CASE WHEN expression, assigning the group depending on the email. But if harvey's solution is confirmed then it's better. – Sebas Dec 04 '14 at 23:47
  • Can you give some context? Programmatically inserting such a large number of rows is unusual a use case. If this is a one-time operation, a long execution time may be a non-issue. If this is a regular synchronisation, then perhaps you should look into replication, or even sharing the user table across your applications. – RandomSeed Dec 10 '14 at 15:51

4 Answers4

6

Bulk inserts by default provide sequential auto increments, with this knowledge you can do your inserts like;

INSERT INTO users (id, mail, name)  
VALUES  (NULL, "foo@bar.tld", "John Smith"), 
        (NULL, "baz@qux.tld", "Anna Smith"),
        (...)  # repeat n-times
;

SET @LASTID=LAST_INSERT_ID()
;

INSERT INTO users_groups (users_id, groups_id)
VALUES    (@LASTID - n  , 1), # Note n in descending sequence
          (@LASTID - n-1, 1),
          ...
          (@LASTID - 1  , 1), 
          (@LASTID - 0  , 4)
;

For more information on bulk inserts and auto increment have a look at http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html

Importantly, make sure that innodb_autoinc_lock_mode=1

show global variables like 'innodb_autoinc_lock_mode'

Otherwise consider wrapping your inserts in LOCK TABLES

LOCK TABLES tbl_name WRITE
... sqls ...
UNLOCK TABLES
harvey
  • 2,945
  • 9
  • 10
  • Are you sure the 1st value line corresponds to the first generated id though? I'd like to see a link about that – Sebas Dec 04 '14 at 23:45
  • @Sebas; There's no way to get the insert id of the first value, which is why you get the last id and subtract backwards. – harvey Dec 04 '14 at 23:48
  • My question is: are you sure `@LASTID - n` is the ID generated for `foo@bar.tld` ? – Sebas Dec 04 '14 at 23:49
  • As long as the amount of rows match yes, a bulk insert is a single atomic insert, is your concern that the numbers are not auto increment numbers are not sequential? – harvey Dec 04 '14 at 23:53
  • In the way you propose your query, it seems you assume `@LASTID - n` goes for "foo@bar.tld", `@LASTID - n-1` goes for "baz@qux.tld" and so on (this is how you apparently define the value for the `groups_id` column). I am saying this might be wrong, even though you get indeed a sequential range of ids at the end. – Sebas Dec 04 '14 at 23:57
  • Of course, again it should be clear that I am assuming that there is an equal number of inserts, and that they are in order. Bulk inserts lock AUTO-INC so you are guaranteed sequential ID's, and inserts happen one at a time in order, you can find more details http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html - if your concern is that they may have multiple rows of groups per user, they simply have to make sure they use the same LASTID-n value. – harvey Dec 05 '14 at 00:07
  • ok, the article seems to say the order of insertion is the order in which the `VALUES` expressions appear in the statement (i.e. what you see is what you get). Then your solution would work. Upvoted. – Sebas Dec 05 '14 at 00:29
  • 3
    The `INSERT` query from the question has an `ON DUPLICATE KEY UPDATE` clause. If `id` is the only `UNIQUE` index of the table, that clause is harmless and not needed. But if there is another `UNIQUE` index on the table (on column `mail` f.e.) and a duplicate key happens during the `INSERT` then all this `LAST_INSERT_ID()` thing is wrong because in this case the `id` is incremented but not used (the existing row is updated). – axiac Dec 05 '14 at 12:12
  • 1
    Yes, that is exactly the case. So your approach will not work for me unfortunately – Horen Dec 05 '14 at 14:58
  • You can structure the insert to `INSERT INTO .... ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), name=...` - this allows LAST_INSERT_ID to be set in this scenario - see http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html for more information. – harvey Dec 07 '14 at 22:03
  • Scratch that last commend - it wont work as your id's will no longer be sequential. – harvey Dec 07 '14 at 22:07
  • Nice, but I think you should start with @lastid-n-1 OR n is a 0 based list. – maugch May 21 '16 at 16:49
1

If you're putting millions of known rows into a table all at once, consider using LOAD DATA INFILE since it's intended for speed in just that type of scenario, as evidenced by this quote from the docs:

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed.

And at Speed of INSERT Statements:

When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.

This is assuming that your source data is coming from, or could be provided as, a text file. If you have the group ID in the file as well, you might do something like this:

CREATE TEMPORARY TABLE load_users_groups (
  mail VARCHAR(60),
  name VARCHAR(60),
  groupid INT,
  PRIMARY KEY (mail, name)
);

LOAD DATA INFILE '/path/to/file.csv'
INTO TABLE load_users_groups
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';  -- use whatever optional syntax required to parse your file

INSERT INTO users (mail, name)
SELECT mail, name FROM load_users_groups
ON DUPLICATE KEY UPDATE name = VALUES(name);

INSERT INTO users_groups (users_id, groups_id)
SELECT users.id, load_users_groups.groupid
FROM users JOIN load_users_groups USING (mail, name);

DROP TEMPORARY TABLE load_users_groups;

Whether this approach ends up being faster than your current approach depends on whether you save more time using LOAD DATA INFILE than you spend performing two additional INSERT ... SELECT statements to move the data into your desired tables. You may want to tweak keys on the temporary table; I can't benchmark it for you based solely on the contents of your question. I'd be interested to know how it works out, though.

The documentation also has a decent number of tips for Bulk Data Loading for InnoDB Tables and Bulk Data Loading for MyISAM Tables. I won't go through them in detail, not least because you haven't given us any DDL or server info, but you may find it helpful to read through one or the other on your own time.

Air
  • 8,274
  • 2
  • 53
  • 88
0

Had to deal with a similar issue.

MySQL doesnt really offer much of a way to reliably reserve large batches of table IDs for this purpose. I spent a good half a day researching to no avail. There are some hacks floating around, but nothing id bank my data on.

I just did the users table with one-by-one inserts (better slow than screwy) and returned the id of the new row to my ORM. I got a row ID to work with, so i was then able to throw it and the data that should be imported into a JSON that keeps them both together. That made it much easier for me to bulk insert and keep the data matched up.

Best.

Thomas Skubicki
  • 536
  • 4
  • 12