Deadlock on MTS replication

Question

Situation:

we have a master-master-replication using GTIDs on Percona MySQL 5.6.32-78.1. On server, there are about 10 databases and we've set slave_parallel_workers=5. One server is used for frontend handling and one for backend. Two or three times a week, the replication on the backend server dies with error

2016-10-25 10:00:01 165238 [Warning] Slave SQL: Worker 4 failed executing transaction '0e7b97a8-a689-11e5-8b79-901b0e8b0f53:22506262' at master log mysql-bin.011888, end_log_pos 9306420; Could not execute Update_rows event on table shop.sessions; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log mysql-bin.011888, end_log_pos 9306420, Error_code: 1213 2016-10-25 10:00:01 165238 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756 2016-10-25 10:00:01 165238 [Note] Error reading relay log event: slave SQL thread was killed

What could be the reason? There are no cross-database DML statements and I thought by using MTS, only one thread is used per database (the benefit of MTS is using parallel replication across several databases)? Why does a repliation break with a deadlock?

EDIT 2016-10-28:

Schema of table looks like

CREATE TABLE `sessions` (
  `id` int(11) NOT NULL,
  `session_id` char(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `crypt_iv` blob NOT NULL,
  `data` mediumblob NOT NULL,
  `user_id` int(11) NOT NULL,
  `last_refresh` datetime NOT NULL,
  `timeout` datetime NOT NULL,
  `closed` tinyint(4) NOT NULL,
  `inserted` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `sessions`
  ADD PRIMARY KEY (`id`),
  ADD UNIQUE KEY `session_id` (`session_id`),
  ADD KEY `user_id` (`user_id`),
  ADD KEY `timeout` (`timeout`);
ALTER TABLE `sessions` MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;

At time this error has only happened on backend side, never on frontend server. At the moment I cannot paste the exact statement as binary logs are purged. But the only statement inside this GTID transaction is a row-based UPDATE on the table.

Are you using "Row Based Replication"? What does the EVENT do? Let's see `SHOW CREATE TABLE` for the table(s) in question. — Rick James, Oct 27 '16 at 22:27
Are the Events declared on both machines? Should they be? (Triggers generally should not be.) — Rick James, Oct 27 '16 at 22:28

score 1 · Accepted Answer · answered Oct 29 '16 at 13:18

1

I guess all sessions are created on the frontend server. Is there maybe a session cleanup job on the backend server? So you have writes on the table from both machines. If you have a write heavy table as sessions you should only write it on one machine to avoid this kind of deadlocks.

Actually you should always do all writes on one machine only, except for failover cases, when one master goes down.

There are nice setups with haproxy and health checks to have the failover handled automatically and transparent for your clients.

answered Oct 29 '16 at 13:18

edlerd

2,145
1
16
24

Really, there is a session cleanup job, but it's running on frontend server already. At the time the deadlock occurs (mostly midnight) our staff is at home and no one is working in admin backend so the only writes on session table come from frontend server. – rabudde Oct 29 '16 at 16:35
Is there another job doing writes on the backend master? If it is at midnight it could be a cronjob. Maybe some backup related task. I have seen deadlocks only with writes on multiple masters at the same time to the same db. That is why also percona recommends to do all writes on a single machine at one point in time. – edlerd Oct 31 '16 at 08:01
Most deadlocks occur on mignight when a daily backup is running. But we use Perconas `innobackupex` which should prevent such failures. But there are some deadlocks around midday too, which cannot be explained with a running backup. – rabudde Oct 31 '16 at 08:29
Oh, sorry, as I cannot believe that Percona backup triggers these problems, I've searched again for other tasks. And really, there is a daily cleanup task which discards old sessions. As it cleans up sessions which aren't used for 7 days, it's a hard thing to understand why cleanup of such old sessions lead to a deadlock. To get into this situation an seven days old session has to be re-used in exactly the same time (second) it will be discarded!? This would be a really big coincidence as it happens 1-2 times a week, wouldn't it? – rabudde Oct 31 '16 at 08:37
so are you doing writes on both masters? that would be one possible cause. – edlerd Oct 31 '16 at 10:03
last evening at 11pm a deadlock occured again and this is a time where no other job/task on backend server is running :( I think this will be very difficult to find the real reason – rabudde Nov 02 '16 at 09:44
You could reach out to the guys from percona. I am sure they could tell you other reasons (if there are any I am missing). https://www.percona.com – edlerd Nov 03 '16 at 18:55

Deadlock on MTS replication

1 Answers1