Cassandra compaction tasks stuck

Question

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0

This should be enough for running a cluster with a light load, storing less than 1 GB of data.

Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.

However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.

Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.

It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.

Nodes with running tasks

Running Tasks

Edit:

Here's the output from nodetool compactionstats

enter image description here

Edit 2:

I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83

Edit 3:

This is output from dstat on a node that behaving normally

dstat from normal node

This is output from dstat on a node with stucked compaction

dstat from node with stucked compaction

Edit 4:

Output from iostat on node with stucked compaction, see the high "iowait"

enter image description here

two comments 1) what does datastax suport say? 2) 7GB of RAM seems like not very much — Vorsprung, Jan 30 '15 at 13:23
I've been running Cassandra on worse specs than that before without any trouble. I don't think that's what's causing the hang — parek, Jan 30 '15 at 13:42
Do you see the same in nodetool compactionstats and compactionhistory? — phact, Jan 30 '15 at 17:25
You can see it yourself in the new edit. It appears that the task show up when the compactionstats command is executed aswell. — parek, Jan 31 '15 at 10:18
Which of your is subsystems is responsible for the increase in load? Is it CPU or disk? If CPU is it iowait, user, steal? — phact, Feb 04 '15 at 07:48
Also is it always the same table being compacted? OpsCenter roll ups? — phact, Feb 04 '15 at 07:51
Always the same table being compacted, the OpsCenter rollups60. I have also tried to truncate it but it ends up stucked in compaction after a while. — parek, Feb 05 '15 at 08:14
As a matter of fact I think that what's getting hurt the most is the disk. From when the "stucking" starts I can see an constantly 50% disk utilization. It's also impossible to restart the machines using "sudo reboot" and "sudo reboot -f". I'm running the VMs in Microsoft Azure so the only way to restart them is through the Azure Management Portal. — parek, Feb 05 '15 at 08:20
Each node have 4 disks each of 1 TB. Could this be an issue? Using Raid-0 on each node the partition where Cassandra data are stored is at 4 TBs. Something is clearly wrong because of the high disk utilization. — parek, Feb 05 '15 at 08:21
Added a suggestion under answers based on the info we have so far. Can you add dstat output from when this happens? — phact, Feb 05 '15 at 22:46
dstat output added, it appears that the CPU is behaving wierd on the nodes stucked in compaction — parek, Feb 10 '15 at 10:00
also added output from iostat, as you can see, there's a really high iowait — parek, Feb 10 '15 at 10:25
High iowait usually means your disks aren't keeping up with your workload. Are these rotating platters? How old? Do you have other storage to test with? Are your commit log and data dirs on separate drives? — phact, Feb 10 '15 at 13:36
You're running on azure. I remember there being an issue with io and the number of azure accounts. Let me find this and get back. — phact, Feb 10 '15 at 13:38

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

4

azure storage

Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.

For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.

See details here

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 10 '15 at 13:44

phact

7,305
23
27

Interesting. I can however confirm that I'm using one Azure Storage Account per node. Currently using RAID-0 on 4 different disks per node, all on the same Storage Account for that specific node. I should not be near the limit of 16,000 IOPS, because of the fact that I'm running 4 disks I should be somewhere around 2k IOPS. However, this must be an Azure related issue. – parek Feb 10 '15 at 13:48
If you bring up new nodes do they have the same issue? Do all of your nodes have this problem? – phact Feb 10 '15 at 21:56
All nodes have the same problem. – parek Feb 10 '15 at 22:01

score 4 · Accepted Answer · answered Feb 13 '15 at 08:49

So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.

Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.

We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.

We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.

We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.

I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.

Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally

These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.

If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.

We are having a similar problem. Did you see tombstone warnings when you looked at your system.log? — Sid, Aug 04 '15 at 18:51

score 3 · Answer 3 · answered Feb 08 '15 at 01:26

3

Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.

answered Feb 08 '15 at 01:26

Erick Ramirez

13,964
1
18
23

saw nothing wierd, I don't see why nodes with almost no load wouldn't behave normally with 7 gigs of ram – parek Feb 10 '15 at 10:01
please see edits in questions, this is very likely some kind of io related issue – parek Feb 10 '15 at 10:34

score 2 · Answer 4 · answered Feb 05 '15 at 22:45

The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.

You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:

Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting

Sources: Opscenter Config DataStax docs Metrics Config DataStax Docs

This is actually really interesting settings, honestly I think that they can be used to tune related issues, however, something is clearly really wrong with my setup and these settings won't solve the problem even tho that they probably will help to tune. — parek, Feb 10 '15 at 10:03

Cassandra compaction tasks stuck

4 Answers4

azure storage

Linked