1

I'm seeing 'Deadlock found when trying to get lock' when issuing a COMMIT while a different node in the WAN Galera cluster has recently had connectivity issues (at virtually the same time as the COMMIT). In this particular situation I'm inserting data into a single table with an auto_increment PK, no FKs, and no other unique constraints. According to logging, the node issuing the COMMIT has yet to recognize that the other node has experienced any issues (cluster size has yet to change after the deadlock exception is thrown). I initially assumed this error had to do with the auto_increment_increment and auto_increment_offset values changing when the cluster size changed, leading to PK conflicts, so I tried to simplify matters by configuring Galera to not manage those values at all and manually set appropriate values across the cluster, but that didn't solve the problem. Based on the Galera documentation it sounds like the committing node verifies that the transaction doesn't cause any issues with the other nodes in the cluster. Based on my auto_increment_* configuration I know that the auto_increment id shouldn't conflict, so, I'm assuming at this point that the committing node is attempting to check the status of the transaction with all nodes, including the node which recently, and very temporarily (< 1 min), went offline, and it rejects the transaction because it can't get a response from the node currently experiencing issues.

I'm relatively new to Galera (8 months) and I was hoping a seasoned Galera veteran might have some advice on the best way of handling this situation. I'm aware of the "retry the transaction" approach, but that strikes me as a bit of a hack, and I'm hoping there is an alternate solution, or at least some additional information as to the underlying cause of this particular issues.

Thanks

Rick James
  • 135,179
  • 13
  • 127
  • 222
human79
  • 11
  • 2

0 Answers0