Cassandra restart issues while restoring to a new cluster

Question

I am restoring to a fresh new Cassandra 2.2.5 cluster consisting of 3 nodes.

Initial cluster health of the NEW cluster:

--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.40.1.1   259.31 KB   256          ?       d2b29b08-9eac-4733-9798-019275d66cfc  uswest1adevc
UN  10.40.1.2   230.12 KB   256          ?       5484ab11-32b1-4d01-a5fe-c996a63108f1  uswest1adevc
UN  10.40.1.3   248.47 KB   256          ?       bad95fe2-70c5-4a2f-b517-d7fd7a32bc45  uswest1cdevc

As part of the restore instructions in Datastax docs, i do the following on the new cluster:

1) cassandra stop on all of the three nodes one by one.

2) Edit cassandra.yaml for all of the three nodes with the backup'ed token ring information. [Step 2 from docs]

3) Remove the contents from /var/lib/cassandra/data/system/* [Step 4 from docs]

4) cassandra start on nodes 10.40.1.1, 10.40.1.2, 10.40.1.3 respectively.

Result: 10.40.1.1 restarts back successfully:

--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.40.1.1   259.31 KB   256          ?       2d23add3-9eac-4733-9798-019275d125d3  uswest1adevc

But the second and the third nodes fail to restart stating:

java.lang.RuntimeException: A node with address 10.40.1.2 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:546) ~[apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:766) ~[apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:693) ~[apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.5.jar:2.2.5]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.5.jar:2.2.5]
INFO  [StorageServiceShutdownHook] 2016-08-09 18:13:21,980 Gossiper.java:1449 - Announcing shutdown

java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
...

Eventual cluster health:

--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.40.1.1   259.31 KB   256          ?       2d23add3-9eac-4733-9798-019275d125d3  uswest1adevc
DN  10.40.1.2   230.12 KB   256          ?       6w2321ad-32b1-4d01-a5fe-c996a63108f1  uswest1adevc
DN  10.40.1.3   248.47 KB   256          ?       9et4944d-70c5-4a2f-b517-d7fd7a32bc45  uswest1cdevc

I understand that the HostID of a node might change after system dirs are removed.

My question is:

Do i need to explicitly state during the start to replace itself? Are the docs incomplete or am i missing something in my steps?

score 0 · Answer 1 · answered Sep 02 '16 at 22:38

0

Turns out there were stale directories commit_log and saved_caches which i missed to delete earlier. The instructions work correctly with those directories deleted.

answered Sep 02 '16 at 22:38

prat0318

571
1
8
19

score 0 · Answer 2 · answered Oct 04 '16 at 14:00

Usually on a situation like this, after i do a

$ systemctl stop cassandra

It i will run the

$ ps awxs | grep cassandra

will notice cassandra still has some features up.

I usually do a

$ kill -9 cassandra.pid

and

$ rm -rf /var/lib/cassandra/data/* && /var/lib/cassandra/commitlog/*

Anower Perves · Answer 3 · 2017-02-19T07:59:51.260

java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.

If you are still facing this above error, that means your cassandra process is running on that node. Login to 10.40.1.3 node firstly. Then follow the following steps-

$ jps

You see some processes running. For example:

9107 Jps 1112 CassandraDaemon

Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.

$ kill -9 1112

Then check processes again after a while-

$ jps

You will see CassandraDaemon will no longer be available.

9170 Jps

Then remove your saved_caches and commitlog and start cassandra again. Do this for all nodes you are suffering with above error you mentioned.

Cassandra restart issues while restoring to a new cluster

3 Answers3