0

I scaled in a TiDB cluster a few weeks ago to remove a misbehaving TiKV peer.

The peer refused to tombstone even after a full week so I turned the server itself off, left a few days to see if there were any issues, and then ran a forced scale-in to remove it from the cluster.

Even though tiup cluster display {clustername} no longer shows that server, some of the other TiKV servers keep trying to contact it.

Example log entries:

[2022/10/13 14:14:58.834 +00:00] [ERROR] [raft_client.rs:840] ["connection abort"] [addr=1.2.3.4:20160] [store_id=16025]
[2022/10/13 14:15:01.843 +00:00] [ERROR] [raft_client.rs:567] ["connection aborted"] [addr=1.2.3.4:20160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error=Some(RemoteStopped)] [store_id=16025]

(IP replaced with 1.2.3.4, but the rest is verbatim)

the server in question has been removed from the cluster about a month now and yet the TiKV nodes still think it's there.

How do I correct this?

the store_id might be a clue - I believe there is a Raft store where the removed server was a leader, but how do I force that store to choose a new leader? The documentation is not clear on this, but I believe the solution has something to do with the PD servers.

Kae Verens
  • 4,076
  • 3
  • 21
  • 41

1 Answers1

1
  1. Could you first check the store id in pd-ctl to ensure it's in tombstone? For pd-ctl usage, please refer to https://docs.pingcap.com/tidb/dev/pd-control. You can use pd-ctl to delete a store and once it's tombstone, then use pd-ctl remove-tombstone to remove it completely.

  2. For all regions in TiKV, if its leader is disconnected, the followers will re-elect leaders and that dead TiKV node won't be leaders of regions anyway.

Qi Xu
  • 26
  • 2
  • thank you. when I run `store 16025`, it says the status is "Offline", but it also says that "used_size" is 0B, so does that mean it's safe to remove? I don't want to assume anything. – Kae Verens Oct 14 '22 at 16:03
  • I tried `store delete 16025` and PD says `Success!`, but when I run `store 16025` the same information shows again. the store is "Offline", with `node_state` 2 – Kae Verens Oct 14 '22 at 16:06
  • I ran `unsafe remove-failed-stores 16025`. after about 30 seconds, the store's state was changed to `tombstone`. but when I ran `store remove-tombstone 16025`, PD said `Failed to remove tombstone store [500] "failed stores: 16025"`. there are 0 results in Google for "Failed to remove tombstone store". – Kae Verens Oct 15 '22 at 17:34