I scaled in a TiDB cluster a few weeks ago to remove a misbehaving TiKV peer.
The peer refused to tombstone even after a full week so I turned the server itself off, left a few days to see if there were any issues, and then ran a forced scale-in to remove it from the cluster.
Even though tiup cluster display {clustername}
no longer shows that server, some of the other TiKV servers keep trying to contact it.
Example log entries:
[2022/10/13 14:14:58.834 +00:00] [ERROR] [raft_client.rs:840] ["connection abort"] [addr=1.2.3.4:20160] [store_id=16025]
[2022/10/13 14:15:01.843 +00:00] [ERROR] [raft_client.rs:567] ["connection aborted"] [addr=1.2.3.4:20160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error=Some(RemoteStopped)] [store_id=16025]
(IP replaced with 1.2.3.4, but the rest is verbatim)
the server in question has been removed from the cluster about a month now and yet the TiKV nodes still think it's there.
How do I correct this?
the store_id
might be a clue - I believe there is a Raft store where the removed server was a leader, but how do I force that store to choose a new leader? The documentation is not clear on this, but I believe the solution has something to do with the PD servers.