26

I have a 2 node apache cassandra (2.0.3) cluster with rep factor of 1. I change rep factor to 2 using the following command in cqlsh

ALTER KEYSPACE "mykeyspace" WITH REPLICATION =   { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };

I then tried to run recommended "nodetool repair" after doing this type of alter.

The problem is that this command sometimes finishes very quickly. When it does finishes like that it will normally say 'Lost notification...' and exit code is not zero.

So I just repeat this 'nodetool repair' until it finishes without error. I also check that 'nodetool status' reports expected disk space for each node. (with rep factor 1, each node has say about 7GB each and I expect after nodetool repair that each is 14GB each assuming no cluster usage in the mean time)

Is there a more correct way to determine that 'nodetool repair' is finished in this case?

Aaron
  • 55,518
  • 11
  • 116
  • 132
user3865568
  • 261
  • 1
  • 3
  • 4

4 Answers4

62

Generally speaking, you can monitor a nodetool repair operation with two nodetool commands:

  • compactionstats
  • netstats

The repair operation has two distinct phases. First it calculates the differences between the nodes (repair work to be done), and then it acts on those differences by streaming data to the appropriate nodes.

This checks on the active Merkle Tree calculations:

$ nodetool compactionstats
pending tasks: 0
Active compaction remaining time :        n/a

The repair streams can be monitored by:

$ nodetool netstats

In fact, TheLastPickle's Aaron Morton suggests using the following Bash script/command to monitor any active repair streams:

while true; do date; diff <(nodetool -h localhost netstats) <(sleep 5 && nodetool -h localhost netstats); done

DataStax has a posting in their support forums about troubleshooting hanging repairs. If you have any hung repair streams, you should be able to see them with a netstats. This can happen if one of your nodes becomes unavailable during the repair process. To monitor the specific repair operations, you can check your log file for entries like this:

DEBUG [WRITE-/172.30.77.197] 2013-05-03 12:43:09,107 OutboundTcpConnection.java (line 165) error writing to /172.30.77.197 java.net.SocketException: Connection reset

Note that repair sessions should also be denoted in your system.log:

[repair #02fc68f0-210c-11e7-aa88-c35a9a02c19a] Starting...

[repair #02fc68f0-210c-11e7-aa88-c35a9a02c19a] Completed...
Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    This is a good answer, here is the source thread by Aaron Morton http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-it-safe-to-stop-a-read-repair-and-any-suggestion-on-speeding-up-repairs-td6607367.html – APZ Aug 01 '14 at 17:06
  • @Aaron Okay, what if `nodetool netstats` tells you that everything is done and `nodetool repair` does not return? Would it then be safe to use Ctrl-C on that run? On my end I was just testing and reset my database, but doing that (Ctrl-C) and then trying to run `nodetool repair` again, it just hanged again... – Alexis Wilke Jul 01 '16 at 01:27
  • 3
    @AlexisWilke It's always safe to Ctrl-C out of a repair command. In fact, the only way to stop a repair it is with a `nodetool stop validation`. There are lots of things that can lead to a hung repair. Monitor the number of pending repairs via JMX, and if that number never reaches zero you may need to bounce the node. Network instability can lead to hung repairs as well. – Aaron Jul 01 '16 at 13:19
  • @Aaron Sorry for the ignorance but, how could I check the number of pending repairs via JMX? I have tried to use Jconsole connecting remotely from my computer to one of the Cassandra nodes in AWS using [this](https://malalanayake.wordpress.com/2013/03/07/jconsole-with-cassandra-db/) but I cannot connect. – Janbalik Jul 22 '16 at 07:43
  • @Aaron `nodetool: compaction_type: can not convert "validation" to a OperationType` - I get that when trying to run `nodetool stop validation` on Cassandra 3.0 (DSE 5.0.3) – 2rs2ts Nov 16 '16 at 21:28
  • What JMX MBean shows the number of pending repairs? Also, I'm not sure that `nodetool compactionstats` is a reliable way to see if repairs are done: sometimes it lists some Validation compactions, but sometimes it doesn't. Are read repairs also included in that output? – Shannon Jan 04 '17 at 22:02
  • `nodetool stop -- VALIDATION` – Leo Romanovsky Dec 08 '17 at 01:51
  • 2
    Using process substitution`<(...`) with `sleep` to avoid writing to a temp/prev/last file is brilliant. I'm familiar with all the components at play here, but would have never thought to use them that way. – Bruno Bronosky Jun 25 '18 at 16:19
  • I am facing this issue - https://stackoverflow.com/q/62886719/1060044 Any hints to fix it? Thanks – Krishna Shetty Jul 18 '20 at 02:09
6

The repair streams can be monitored with option --trace when you start repair command:

nodetool repair --trace <key_space> <table>

exic
  • 2,220
  • 1
  • 22
  • 29
tjeubaoit
  • 986
  • 7
  • 7
0

We can also monitor the progress of repair in Opscenter console under Activities.

0

If anyone is still wondering how to monitor the status of nodetool repair with recent versions of Cassandra, starting from Cassandra 4.0 there is the new nodetool repair_admin command to track and interrupt repair operations.

nodetool repair_admin list

It is backed by the new system table system.repairs which contains the history of repair operations.

vinsce
  • 1,271
  • 1
  • 10
  • 19