3

I am not a db expert so I hope someone can give me a pointer. Is there an advantage of using a vanilla approach vs an upsert appoach. I find the vanilla approach clearer with no obvious side effects.

This table is a summary table created from time consuming reads and calculations (average 3-4 seconds to complete). This table is disposable (it can be truncated at any time). This table receives a lot of updates and reads with minimal inserts.

Initially each read access would update and then read the data. To improve performance, I added a 5 minute delay between updates

  • after an update occurs for a group, for the next 5 minutes users in that group will only read the data (saving 3-4 seconds)
  • the first access after the 5 minutes has elapsed will trigger a full update for the group data

Initially I implemented an upsert algorithm. However a side effect of the upsert-insert is that the auto-increment pk field is updated with every execution. This resulted in a jump of 100000 for that key in a few hours. My concern is not the gaps, but hitting the max value for an int. I can manage this by either removing the auto-increment field or issuing a truncate command everynight. I also saw a few other creative solutions to prevent the auto increment.

However I prefer not to manage a side effect. I decided to try resolve this by using a different approach. Especially since the is a read and updatecentric table.

Revised approach

    $lastInsertId=0;
    $q->beginTransaction(); 
    
    $sql = 'UPDATE data_snapshots SET field1=:field1,field2=:field2,field3=:field3,id = LAST_INSERT_ID(id) WHERE groupId=:groupId';
    $args=[':groupId'=>$groupId,':field1'=>$field1,':field2'=>$field2,':field3'=>$field3];
    $q->prepare ( $sql );
    $result = $q->execute ( $args );
    $lastInsertId = $q->lastInsertId();
    
    if($lastInsertId == 0){
        $sql='INSERT INTO data_snapshots (groupId,field1,field2,field3)';
        $q->prepare ( $sql );
        $result = $q->execute ( $args );
        $lastInsertId = $q->lastInsertId();
    }
    
    if($result == true && $lastInsertId  > 0){
        $q->commit();
        $modified=true;
    }
    else{
        $q->rollBack();
    }

Upsert approach

    $sql='INSERT INTO data_snapshots
    (groupId,field1,field2,field3)
    VALUES
    (:groupId,:field1,:field2,:field3)
    ON DUPLICATE KEY UPDATE
    groupId=:groupId_dup,field1=:field1_dup,field2=:field2_dup,field3=:field3_dup'];

    $args=[':groupId'=>$groupId,':field1'=>$field1,':field2'=>$field2,':field3'=>$field3,
    ':groupId_dup'=>$groupId,':field1_dup'=>$field1,':field2_dup'=>$field2,':field3_dup'=>$field3,]
    
    $q->prepare ( $sql );
    $result = $q->execute ( $args );

1 Answers1

0

INSERT...ON DUPLICATE KEY UPDATE does not increment the auto-increment counter.

Your "vanilla approach" has a race condition problem. You try to UPDATE but if it affects no rows, then you INSERT. But what if some other client inserts the row in the brief amount of time between your UPDATE and your INSERT?

You might think this is unlikely to the point where you don't have to worry about it, but we frequently develop code that runs many transactions per second, and from many concurrent clients. There's an old cautionary saying, "one in a million is next Tuesday," meaning even though you assume the race condition is very unlikely to happen, it's sure to happen sooner than you think, given a large number of samples.


By the way, you can eliminate all the *_dup parameters from your prepared statement. In your ON DUPLICATE KEY clause, use VALUES() like this:

INSERT INTO data_snapshots (groupId,field1,field2,field3)
VALUES (:groupId,:field1,:field2,:field3)
ON DUPLICATE KEY UPDATE
groupId=VALUES(groupId), field1=VALUES(field1), field2=VALUES(field2), field3=VALUES(field3);

This syntax means if the ON DUPLICATE KEY clause is executed, the values to update will be the same values that you provided in the parameters in the VALUES() clause above. If this is the typical case, then you don't have to pass each value twice.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • Hello Bill, Thank you for your response. In mysql the INSERT...ON DUPLICATE KEY UPDATE does increment an auto increment field, however it does not use the new value so it is only seen when inserting a new row. This is documented in [other posts](https://stackoverflow.com/questions/548541/insert-ignore-vs-insert-on-duplicate-key-update/548570#548570) and already I experienced this as a side effect. – Ninety nine Mar 15 '21 at 20:19
  • Your comment about creating a race condition is a valid point. It would be valuable to know (in mysql) if using the upsert syntax prevents a race condition or deadloack any better than the transaction I suggested above. In that case I would consider using upsert and dropping the auto increment id and using the groupId as the index. Also this table is primarily updates, with minimal inserts. – Ninety nine Mar 15 '21 at 20:20
  • The statement acquires a lock on the row, which prevents concurrent updates: "INSERT ... ON DUPLICATE KEY UPDATE differs from a simple INSERT in that an exclusive lock rather than a shared lock is placed on the row to be updated when a duplicate-key error occurs. An exclusive index-record lock is taken for a duplicate primary key value. An exclusive next-key lock is taken for a duplicate unique key value." (https://dev.mysql.com/doc/refman/8.0/en/innodb-locks-set.html) – Bill Karwin Mar 15 '21 at 20:43
  • "INSERT ... ON DUPLICATE KEY UPDATE" superior locking. incrementing side effect. My last hurdle then is that for my scenario where update is 90% vs insert it appears wasteful to generate this extra insert centrc activity. – Ninety nine Mar 15 '21 at 21:38
  • What incrementing side effect? I tested it, and I don't see any effect on the auto-increment when I try to insert a duplicate row. – Bill Karwin Mar 15 '21 at 22:23
  • The "other posts" link above has a demonstrable example of the auto increment side effect. The gap auto-incremented value will only appear on an insert after several updates and then an insert. This side effect is real. My requirement is such that we insert 1000 rows and then those 1000 rows see only rapid updates for the next few months, I will go with an updatecentric solution, not the INSERT ON DUPLICATE approach, which, in my case, would generate a key violations for each update. Also the INSERT ON DUPLICATE algorithm is still prone to deadlocks, so there is no perfect solution. – Ninety nine Mar 16 '21 at 16:36