2

We have a Cassandra 2.0.17 cluster with 3 DCs, where each DC has 8 nodes and RF of 3. We have not been running regular repairs on it.

One node has been down for 2 months due to hardware issue with one of the drives. We finally got a new drive to replace the faulty one, and are trying to figure out the best way to bring the node back into the cluster.

We initially thought to just run nodetool repair but from my research so far it seems like that would only be good if the node was down for less than gc_grace_seconds which is 10 days.

Seems like that would mean removing the node and then adding it back in as a new node. Someone mentioned somewhere that rather than completely removing the node and then bootstrapping it back in, I could potentially use the same procedure used for replacing a node, using the replace_address flag (or replace_address_first_boot if available), to replace the node with itself. But I couldn't find any real documentation or case studies of doing this.

It seems like this is not a typical situation - normally, either a node goes down for a short period of time and you can just run repair on it, or it needs to be replaced altogether. But it's hard to find much prior art on our exact use case.

What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?

Also, whatever the answer is, how would I monitor the process and ensure that it's successful?

Muhammad Dyas Yaskur
  • 6,914
  • 10
  • 48
  • 73
amitzko
  • 51
  • 3

2 Answers2

0

So here's what I would do:

  • If you haven't already, run a removenode on the "dead" node's host ID.
  • Fire-up the old node, making sure that it is not a seed node and that auto_bootstrap is either true or not specified. It defaults to true unless explicitly set otherwise.
  • It should join right back in, and re-stream its data.
  • You can monitor it's progress by running nodetool netstats | grep Already, which returns a status by each node streaming, specifying completion progress in terms of # of files streamed vs. total files.

The advantage of doing it this way, is that the node will not attempt to serve requests until bootstrapping is completed.

If you run into trouble, feel free to comment here or ask for help in the cassandra-admins channel on DataStax's Discord server.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • As a 2nd step, I would also go ahead and clear the data, hints, commitlog, saved_caches directories and keep just the installation files to be sure – Madhavan Dec 02 '21 at 23:18
0

You have mentioned already that you are aware that node has to be removed if it is down for more than gc_grace_seconds

What would be the best options for bringing this node back into service in a safe way, ideally with the least amount of impact to the rest of the cluster? Is repair really not a good option here?

So the answer is that only. You cannot safely bring that node back if it is down more than gc_grace_seconds. It needs to be removed to prevent possible deleted data from appearing back.

https://stackoverflow.com/a/69098765/429476


From https://community.datastax.com/questions/3987/one-of-my-nodes-powered-off.html Erick Ramirez answered • May 12 2020 at 1:19 PM | Erick Ramirez edited • Dec 03 2021 at 4:49 AM BEST ANSWERACCEPTED ANSWER

@cache_drive If the node has been down for less than the smallest gc_grace_seconds, it should be as simple as starting Cassandra on the node then running a repair on it.

If the node has been down longer than the smallest GC grace, you will need to wipe the node clean including deleting all the contents of data/, commitlog/ and saved_caches/. Then replace the node "with itself" by adding the replace_address flag and specifying its own IP. For details, see Replacing a dead node. Cheers!

Alex Punnen
  • 5,287
  • 3
  • 59
  • 71