4

Context

We have 6 instances of Cassandra hosted on AWS, separated into 3 different regions, 2 per regions (2 in eu-west, 2 in us-west, 2 in ap-southeast).

2 days ago, we moved 2 of our EC2 Cassandra instances from us-west-1 to us-east-1. When I say "move" I mean that we decommissioned them and added 2 new instances on our cluster.

We ran nodetool repair which didn't do anything, and nodetool rebuild which synchronized our data from the eu-west data centre. Following that change we noticed that multiple instances on our Cassandra cluster were using over 70% CPU and had incoming traffic.

At first, we thought it was the replication taking place, but considering that we only had 500MB of data, and that it is still running we are puzzled as to what is happening.


Instances hardware:

All of our instances are running on m3.medium which means that we are on:

  • 1 CPU, 2.5 GHz
  • 3.75 GB of RAM
  • 4GB SSD

Also we have mounted an EBS volume for /var/lib/cassandra which is actually a RAID0 of 6 SSDs on EBS:

  • EBS volume 300GB SSD, RAID0

Ref: Amazon Instances Types


Software Version:

Cassandra Version: 2.0.12


Thoughts:

After analysing our data we thought this was caused by Cassandra data compaction.

There is another stackoverflow question around the same subject: Cassandra compaction tasks stuck.

However, this was solved by going for a single SSD (Azure Premium Storage - still in preview) and no RAID0 configured for Cassandra, and as the author said, there is no reason for this to fix the underlying problem (why would removing the RAID0 part from the equation fix this?).

We are not keen yet to move to a local storage as AWS pricing is a lot higher than what we have now. Even though, if it really is the cause of our problem, we will try it.

Another reason why this sounds like a deeper problem is that we have data showing that these EBS volumes have been writing/reading a lot of data in the last 3 days.

Since we moved instances, we get something around 300-400KB of written data per second on each EBS volume, so since we have a RAID0, 6 times this amount per second = 1.8-2.4MB/s. This amounts to ~450GB of data written PER instance over the last 3 days. And we have basically the same value for READ operation too.

We are only running tests on them at the moment, so the only traffic we are getting is coming from our CI server and eventually from the communication that Gossip is doing between the instances.


Debug notes

Output of nodetool status:

Datacenter: cassandra-eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 539.5 MB   256     17.3%  12341234-1234-1234-1234-12341234123412340cd7  eu-west-1c
UN  xxx.xxx.xxx.xxx 539.8 MB   256     14.4%  30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
Datacenter: cassandra-ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 585.13 MB  256     16.9%  a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  xxx.xxx.xxx.xxx 588.66 MB  256     17.8%  b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: cassandra-us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns   Host ID                               Rack
UN  xxx.xxx.xxx.xxx 545.56 MB  256     15.2%  ab049390-f5a1-49a9-bb58-b8402b0d99af  us-east-1d
UN  xxx.xxx.xxx.xxx 545.53 MB  256     18.3%  39c698ea-2793-4aa0-a28d-c286969febc4  us-east-1e

Output of nodetool compactionstats:

pending tasks: 64
          compaction type        keyspace           table       completed           total      unit  progress
               Compaction         staging    stats_hourly       418858165      1295820033     bytes    32.32%
Active compaction remaining time :   0h00m52s

Running dstat on unhealthy instance:

dstat on unhealthy instance

Compaction history in graph form (average of 300 times per hour starting the 16th):

Compaction history graph

EBS Volumes usage:

EBS Volume 1

EBS Volume 2

Running df -h:

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       33G   11G   21G  34% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            1.9G   12K  1.9G   1% /dev
tmpfs           377M  424K  377M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            1.9G  4.0K  1.9G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb       3.9G  8.1M  3.7G   1% /mnt
/dev/md0        300G  2.5G  298G   1% /var/lib/cassandra

Running nodetool tpstats:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0        3191689         0                 0
ReadStage                         0         0         574633         0                 0
RequestResponseStage              0         0        2698972         0                 0
ReadRepairStage                   0         0           2721         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
MiscStage                         0         0          62601         0                 0
HintedHandoff                     0         1            443         0                 0
FlushWriter                       0         0          88811         0                 0
MemoryMeter                       0         0           1472         0                 0
GossipStage                       0         0         979483         0                 0
CacheCleanupExecutor              0         0              0         0                 0
InternalResponseStage             0         0             25         0                 0
CompactionExecutor                1        39          99881         0                 0
ValidationExecutor                0         0          62599         0                 0
MigrationStage                    0         0             40         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropyStage                  0         0         149095         0                 0
PendingRangeCalculator            0         0             23         0                 0
MemtablePostFlusher               0         0         173847         0                 0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Running iptraf, sorted by bytes:

iptraf sorted by bytes

Community
  • 1
  • 1
Vincent B.
  • 4,038
  • 1
  • 23
  • 18
  • Can you add nodetool tpstats and netstats output? – Stefan Podkowinski Mar 19 '15 at 12:21
  • Done. I added iptraf instead of netstats. The image looks a bit small but you can open its URL in a tab to see its full size. Hope that helps. – Vincent B. Mar 19 '15 at 12:46
  • Ah I also added Amazon EBS stats... – Vincent B. Mar 19 '15 at 12:47
  • Are you using leveled compaction strategy? – Stefan Podkowinski Mar 19 '15 at 20:57
  • We are using the default compaction strategy (Size-tiered compaction). I read a bit on leveled compaction, and it would have been a good guess, but we are not using it :(. – Vincent B. Mar 20 '15 at 10:23
  • Ok, next question then :) Does your cluster see alot of deletes (whats the update/delete rate). What gc_grace period do you use and would you expect Cassandra to actually cleanup alot of data based on gc_grace on regular basis? – Stefan Podkowinski Mar 20 '15 at 10:35
  • Stefan, at present, we have an INCREDIBLY low number of updates and deletes, we are not actually in production so the only reads & writes are our uptime tests and some traffic from our CI test suites. We're probably talking no more than a few hundred updates / deletes a minute at most across the entire cluster of 6 servers. Also, we tried restarting one of the servers to see what effect it had, and now as a result a server in another data centre has spiked to 100% CPU immediately following the restart. This is worrying and confusing. – Matthew O'Riordan Mar 20 '15 at 13:23
  • *Correction*, the 100% CPU spike described above was unrelated, apologies for that. – Matthew O'Riordan Mar 20 '15 at 15:27
  • @StefanPodkowinski we had an unexpected issue with the 2 new servers and we had to terminate them. The cluster became healthy again... I did spin up new servers in the same region with the same configuration and it looks fine now after a rebuild. I have a hunch that it's something to do with the `nodetool rebuild` command. But I can't pinpoint where this all started... – Vincent B. Mar 20 '15 at 16:36

2 Answers2

2

We tried a few things from other answers and comments, but what finally solved this issue was terminating the 2 new instances.

When we tried adding new instances to our cluster it went smoothly and the load is now back to normal.

My hunch is that nodetool rebuild or nodetool repair may have started unexpected processing for our two nodes. It may also be possible that these particular instances were faulty, but I have not found any evidence of it.

Here's the CPU usage on our eu-west instances after recycling the us-east instances:

CPU Usage

Vincent B.
  • 4,038
  • 1
  • 23
  • 18
1

m3.medium is on the low end to run C* with any real load (450GB is real load) and EBS is far from optimal. You could try using the ephemeral drive for data to get rid of some of the latency on read/write requests but your probably going to simply need more cpu for this kind of thing.

Check out: http://www.datastax.com/documentation/cassandra/2.1/cassandra/planning/architecturePlanningEC2_c.html

Chris Lohfink
  • 16,150
  • 1
  • 29
  • 38
  • 1
    We had this infrastructure running for quite some time now with no problem at all. "your hardware is not good enough" doesn't sound right. Also what you mention as 450GB is the amount of read/write done. Not the traffic/load on the system. As I said, we have nearly no queries running at the moment. So it sounds crazy to think that 450GB has been written/read. – Vincent B. Mar 19 '15 at 15:25
  • C* is designed more for running on larger boxes. It parallelizes work which is hard on a 1 core box. The recommendations for a minimal development environment even is a m3.large (m3.xlarge or c3.2xlarge for production). EBS is also recommended against quite a bit. That said it still may work in some loads well enough. When running below the minimal specs it shouldn't be all that surprising to see things like high cpu. It has to thrash between compactions, requests, and cluster communication. You could try tuning compaction some since you do seem behind in that. That what your looking for? – Chris Lohfink Mar 19 '15 at 20:08
  • I'll try tweaking compaction settings but that wouldn't explain why suddenly this started. If that doesn't work, we'll have to try getting bigger instances. I'll let you know how that goes. – Vincent B. Mar 20 '15 at 10:25
  • Chris, I do struggle to see how the instance type would be that relevant in this case. If the instances have been stable until the data center was moved, and there is almost no load on the instances anyway (500mb), this seems like another issue. – Matthew O'Riordan Mar 20 '15 at 10:46
  • it could be, theres things that can trigger a lot of compactions like an inconsistent node getting repaired, bootstraping, removing nodes etc. But if the nodes you have cant handle the additional load of a cluster change event it would mean trouble eventually. – Chris Lohfink Mar 20 '15 at 16:35
  • I agree on your point about having bigger instances in the future to better cope with cluster changes, however, that answer was beside the point of finding out what was going wrong. Moving to a bigger instance wouldn't fix my underlying problem, it would only make the cluster look healthy longer. – Vincent B. Mar 23 '15 at 14:17