2

Running an InnoDB Cluster using 5.7.25 (planning to migrate to 8.0 shortly) Two of my instances have left the cluster due to network issues, and I'm left with one healthy node.

I'm doing the following procedure to add a node to the cluster, which fails with the errors shown below.

What am I doing wrong ?

Note: host1 is the healthy node left in the cluser. host2 is the one joining

Procedure on host1:

  1. Set super_read_only = ON
  2. Copy last GTIDs using: select @@global.gtid_executed;
  3. Set super_read_only = OFF (right before step 3 on host2)

Procedure on host2:

  1. Stop mysql
  2. rsync mysql data dir from host1 using: rsync -Parvz --exclude="auto.cnf" --exclude="<host1>*" --exclude="binlog.*" <user>@<host1>:/mysql-data/* .
  3. Start mysql
  4. Clear replication logs and set GTID's using:
reset master;
reset slave;
set SQL_LOG_BIN=0; 
set @@GLOBAL.GTID_PURGED='<gtid from step 2 on host1>`;
set SQL_LOG_BIN=1; 
  1. Connect to MySQL Shell and add the new node (host2) to the cluster: cluster.addInstance('root@host2:3306', {ipWhitelist: 'host1, host2'})

Logs from new instance which fails to join (host2):

2020-03-09T15:19:33.328996Z 38 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind
=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-03-09T15:19:33.514003Z 38 [Note] Plugin group_replication reported: 'Group communication SSL configuration: group_replication_ssl_mode: "DISABLED"'
2020-03-09T15:19:33.514154Z 38 [Warning] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the whitelist. It is mandatory that it is added.'
2020-03-09T15:19:33.514181Z 38 [Note] Plugin group_replication reported: '[GCS] SSL was not enabled'
2020-03-09T15:19:33.514193Z 38 [Note] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: "<uuid1>"; group_replication_local_address: "host2:33061"; group_replication_group_seeds: "host1:33061"; group_replication_bootstrap_group: false; group_replication_poll_spin_loops: 100; group_replication_compression_threshold: 1000; group_replication_ip_whitelist: "host1ip, host2ip"'
2020-03-09T15:19:33.514223Z 38 [Note] Plugin group_replication reported: '[GCS] Configured number of attempts to join: 0'
2020-03-09T15:19:33.514227Z 38 [Note] Plugin group_replication reported: '[GCS] Configured time between attempts to join: 5 seconds'
2020-03-09T15:19:33.514239Z 38 [Note] Plugin group_replication reported: 'Member configuration: member_id: 139923628; member_uuid: "<uuid2>"; single-primary mode: "true"; group_replication_auto_increment_increment: 7; '
2020-03-09T15:19:33.514576Z 40 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2020-03-09T15:19:33.613296Z 43 [Note] Slave SQL thread for channel 'group_replication_applier' initialized, starting replication in log 'FIRST' at position 0, relay log './scynbm96-relay-bin-group_replication_applier.000001' position: 4
2020-03-09T15:19:33.613383Z 38 [Note] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2020-03-09T15:19:33.613811Z 0 [Note] Plugin group_replication reported: 'XCom protocol version: 3'
2020-03-09T15:19:33.613858Z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 33061'
2020-03-09T15:19:33.667118Z 0 [Warning] Plugin group_replication reported: 'read failed'
2020-03-09T15:19:33.685025Z 0 [ERROR] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33061'
2020-03-09T15:19:34.732938Z 48 [Note] Got an error reading communication packets
2020-03-09T15:20:04.733653Z 52 [Note] Got an error reading communication packets
2020-03-09T15:20:33.613595Z 38 [ERROR] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2020-03-09T15:20:33.613655Z 38 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2020-03-09T15:20:33.613697Z 38 [ERROR] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2020-03-09T15:20:33.614136Z 43 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2020-03-09T15:20:33.614325Z 43 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 0
2020-03-09T15:20:33.614966Z 40 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2020-03-09T15:20:34.734155Z 55 [Note] Got an error reading communication packets
nirg
  • 61
  • 1
  • 6

1 Answers1

0

The following steps finally allowed me to form a healthy 3 node cluster.

  1. Set the healthy node to super_read_only
  2. Wait a little while for existing transactions to complete
  3. Copy the GTIDs using select @@global.gtid_executed;
  4. On host2 and host3, install mysql from scratch
  5. On host2 and host3, stop mysql server
  6. rsync the data to both hosts using: rsync -Parvz --exclude="auto.cnf" --exclude="<host1>*" --exclude="binlog.*" <user>@<host1>:/mysql-data/* .
  7. Verify that the GTIDs have not changed on host1
  8. Start mysql on host2 and host3, verify data is intact by selecting on some tables
  9. Using the mysql shell, dissolve the cluster
  10. Create the cluster again, adding host2 and host3 from the beginning of its existence.

Note: After the cluster is dissolved you'll need to restart all MySQL Routers Note2: There's some monitoring info here: https://dev.mysql.com/doc/refman/5.7/en/group-replication-monitoring.html (Version 8.x adds further logging and instrumentation)

nirg
  • 61
  • 1
  • 6