4

While performing INSERT...ON DUPLICATE KEY UPDATE on InnoDB in MySQL, we are often told to ignore the potential gaps in auto_increment columns. What if such gaps are very likely and cannot be ignored?

As an example, suppose there is one table rating that stores the users' ratings of items. The table scheme is something like

CREATE TABLE rating (
  id INT AUTO_INCREMENT PRIMARY KEY,
  user_id INT NOT NULL,
  item_id INT NOT NULL,
  rating INT NOT NULL,
  UNIQUE KEY tuple (user_id, item_id),
  FOREIGN KEY (user_id) REFERENCES user(id),
  FOREIGN KEY (item_id) REFERENCES item(id)
);

It is possible that there are many users and many items, while users may frequently change the ratings of items that they have already rated before. Every time a rating is changed, a gap is created if we use INSERT...ON DUPLICATE KEY UPDATE, otherwise we will have to query twice (do a SELECT first) which is performance harming or check affected rows which cannot accommodate multiple records INSERT.

For some system where 100K users each has rated 10 items and changes half of the ratings every day, the auto_increment id will be exhausted within two years. Then what should we do to prevent it in practice?

peter
  • 1,034
  • 1
  • 9
  • 23
  • 2
    Use a `BIGINT` instead of an `INT`. It should last you centuries. –  Mar 28 '15 at 08:11
  • Gaps its ok! Just use bigger id field. Don't try to reuse gaps. This is a bad idea. – Alexander R. Mar 28 '15 at 08:16
  • @HoboSapiens But this also harms performance... I don't know, maybe I was being too paranoid? – peter Mar 28 '15 at 08:18
  • It won't harm your performance. Also, if you are going to exhaust the regular 4-byte integer in 2 years, then you also have no other choice but to use bigint. Alternatives are using guids and various silly composite keys which will definitely hurt your performance. – N.B. Mar 28 '15 at 09:21
  • This might sound like a stupid question, but why would you want to create a different entry when a user changes a rating? Is it historical? Why not just update the user's rating for an item? –  Mar 28 '15 at 11:02
  • @IkoTikashi This is not only for changing a rating; it is also likely that the user is rating on an item that he/she has never rated before. The problem is that before querying the database, we do not know if such `(user_id, item_id)` pair already exists in the table. – peter Mar 28 '15 at 11:19
  • I see, although I'd find it "easier" to first check if the user already has a rating for that item and INSERT if not, or UPDATE if he already has. –  Mar 28 '15 at 11:22
  • @N.B. Have a look at [this](http://stackoverflow.com/a/1460544/1027993). It looks like if I always use the composite keys as a combination, the performance would not hurt. – peter Mar 28 '15 at 11:30
  • Answer you linked still uses auto_increment as the first node of composite key, the important part is still handled by MySQL and not by you - therefore, there are no problems with such keys. However, what exact performance problem will you have if you use a `bigint` opposed to `int`? And why would you use composite keys at all? You are optimizing prematurely. No matter what type of index you can come up with, it will **never** be better than a `bigint` primary key, both in terms of performance and ease of implementation. – N.B. Mar 28 '15 at 11:35

2 Answers2

3

Full answer.

Gaps it's ok! Just use bigger id field, for example BIGINT. Don't try to reuse gaps. This is a bad idea. Don't think about performance or optimization in this case. Its waste of time.

Another solution is to make composite key as primary. In your case, you can remove id field, and use pair (user_id, item_id) as primary key.

In case of "rating" the most frequent queries are "delete by user_id", and inserting. So you are not really need this "id" primary key for functionality. But you always need any primary key to be presented in table.

The only drawback of this method is, that now when you want to delete just one row from the table, you will need to use query something like:

DELETE FROM rating WHERE user_id = 123 AND item_id=1234

instead of old

DELETE FROM rating WHERE id = 123

But in this case it isn't hard to change one line of code in your application. Furthermore, in most cases people doesn't needs such functionality.

Alexander R.
  • 1,756
  • 12
  • 19
  • Thank you, your suggestion regarding composite primary key looks like a good solution. I tried to avoid those gaps using PostgreSQL 9.5, but the same phenomenon happens with PostgreSQL. – W.M. Aug 07 '16 at 11:14
2

We work in a large table and we have tables with 100s millions of records in some table. We repeatedly use INSERT IGNORE or INSERT.. ON DUPLICATE KEY. Making the column as unsigned bigint will avoid the id issue.

But I would suggest you to think of long term solution as well. With some known facts.

  • SELECT and INSERT/UPDATE is quite often faster than INSERT..ON DUPLICATE KEY, again based on you data size and other factors
  • If you have two unique keys ( or one primary and one unique key), your query might not always predictable. It gives replication error if you use statement based replication.
  • ID is not the only issue with large tables. If you have table with more than some 300M records, performances degrades drastically. You need to think of partitioning/clustering/sharding your database/tables pretty soon

Personally I would suggest not to use INSERT.. ON DUPLICATE KEY. Read extensively on its usage and performance impact if you are planning for a highly scalable service

georgecj11
  • 1,600
  • 15
  • 22
  • 1
    @peter, as I said read on it. Suppose column c1 and c2 are suppose to be unique. Let the rows be and . No in insert statment you pass . Which row should be updated first or second. MySQL just updates first row it gets, which might not be even same in Master and slave(s). There are lot of blogs and Q&A on internet read more on the same – georgecj11 Mar 29 '15 at 06:40