Galera Cluster 2 nodes - Unable to restart MySQL server on node 1

Question

i am experiencing this issue, similar to this one: Unable to restart MySQL server but i not sure how to proceed so i am asking to the community especially with someone with more experience than me on Galera Cluster. I'll try to summarize:

Configuration:

Galera Cluster 2 nodes - Every node is an Ubuntu 16.04 and has Mariadb 10.2.17 version.

Issue:

One of the node (node1) is in fault, unfortunately there is no error-log or general-log configured but on journalctl i can see that the error is something like "mariadb innodb assertion failure in file" and it's suggest to try innodb_force_recovery (1 to 6) but i don't know how the Galera sync works, or if it's an active\active configuration so i am not confident to eventually start a node not synchronized since days risking a split brain situation. Also, i see on the datadir that a file called "sst_in_progress" is present.

Consideration:

Will be ok to eventually delete the datadir on the fault node and restart the mysql service? Could it be enough to make it starts to sync with the node2 replicating the data without touching datas on node2 who is currentrly delivering service to the clients? Also as far as i understand Galera cluster doesn't replicate system tables so i should export mysql.user table from node2 and import on node1 to have all the users and permission back. Thanks, i hope i succeed to explain the issue, if it's not clear please tell me.

score 0 · Answer 1 · answered Nov 17 '21 at 11:26

The file sst_in_progress means that the broken node has requested an SST (State Snapshot Transfer), which is basically a full data transfer from the other node in the cluster. There are several different SST methods that you can use, and you can see which one you have enabled by checking the wsrep_sst_method variable. Important to note, is that the donor and joiner node must use the same SST method. For more information about the different SST methods, and SST's in general, I recommend the mariadb documentation

The SST should be able to rejoin the broken node back into the cluster. You can see the progress of the SST in the mysql error logs. But, as you do not have that configured, you could instead check the wsrep status (e.g. show global status like '%wsrep%;') on the nodes. You can see the node status by checking wsrep_local_state_comment. If the healthy node is tranfering an SST to the broken node, you will see that the value of wsrep_local_state_comment is Donor/Desynced. More detailed information about the various wsrep variables can be found in the Galera documentation

If the automatic SST has problems, you can instead do a manual SST. For Mariadb 10.1 or later, it is recommended to use Mariabackup for this. You can find information on doing a manual SST with Mariabackup in the Mariadb documentation.

In answer to your question about deleting the datadir on the broken node, and restarting the MySQL service: This would force the broken node to request an SST from the other node. Please note that starting the MySQL service may timeout, as depending on the size of the datadir, this process can take a long time.

The SST will also transfer the system tables to the broken node, so after the SST is complete, the mysql.user table should be complete with the users and permissions, and you should not need to recreate them.

As a side observation, I see that you are using a 2 node Galera cluster. In general, it is recommended to use at least 3 nodes. If you are operating a 2 node Galera cluster, you may want to use a Galera arbitrator. More information about that can be found in the Galera documentation

Thanks, i figured it out by myself because it was very urgent. All the steps you provided are the same i did and it fixed my corrupted node. Thanks anyway. — mescalero0, Nov 22 '21 at 08:57

Galera Cluster 2 nodes - Unable to restart MySQL server on node 1

1 Answers1