We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.
One node has been down for 2 months due to hardware issue with one of the drives. We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.
We initially thought to just run nodetool repair
but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.
Seems like that would mean removing the node and then adding it back in as a new node.
Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address
flag (or replace_address_first_boot
if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.
It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.
What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?
Also, whatever the answer is, how would I monitor the process and ensure that it's successful?