5

Setup:
We have 3 nodes Cassandra cluster having data of around 850G on each node, we have LVM setup for Cassandra data directory (currently consisting 3 drives 800G + 100G + 100G) and have separate volume (non LVM) for cassandra_logs

Versions:
Cassandra v2.0.14.425
DSE v4.6.6-1

Issue:
After adding 3rd (100G) volume in LVM on each of the node, all the nodes went very high in disk I/O and they go down quite often, servers also become inaccessible and we need to reboot the servers, servers don't get stable and we need to reboot after every 10 - 15 mins.

Other Info:
We have DSE recommended server settings (vm.max_map_count, file descriptor) configured on all nodes
RAM on each node : 24G
CPU on each node : 6 cores / 2600MHz
Disk on each node : 1000G (Data dir) / 8G (Logs)

Fawad
  • 139
  • 1
  • 2
  • 9
  • What operations are you doing on Cassandra? – Abhinandan Satpute Apr 07 '16 at 17:54
  • Mostly write operations, we have Cassandra running with Solr so we index the data which we want to read and read it from Solr indexes. – Fawad Apr 07 '16 at 20:32
  • Is there a special state where the servers stop to react? – ramo Apr 07 '16 at 20:34
  • Whenever I start DSE service it starts compaction on one of biggest keyspace which leads to high disk I/O and later node goes down. – Fawad Apr 07 '16 at 20:46
  • What kind of disks are these? By the sounds of it you are running out of disk bandwidth. – Patrick McFadin Apr 07 '16 at 21:04
  • Our Cassandra nodes are VMs and disks on hypervisors are like this NL-SAS + SSD (for write caching) -> Ceph -> VM – Fawad Apr 07 '16 at 21:32
  • @PatrickMcFadin I am running this cluster on these disks since 4 months with more or less same amount of data, we only started having this issue since 2 days after extending LVM with additional 100G volume on each node, can LVM be the cause of this issue ? – Fawad Apr 08 '16 at 00:53
  • I suspect you are going past the limit on throughput which is increasing your atime. Can you paste the output of a tpstats on an affected node? – Patrick McFadin Apr 08 '16 at 00:57
  • @PatrickMcFadin Although we have this issue on all 3 nodes of the cluster but the output I am pasting here is from the node which is most affected http://pastebin.com/Pd9EpzQX – Fawad Apr 08 '16 at 01:27
  • Other nodes Node02 : http://pastebin.com/ZkYQ9N98 Node03 : http://pastebin.com/jf8S29bh – Fawad Apr 08 '16 at 01:29
  • Ok. I see your problem. Well, the reason your node is dying. I'll answer in the answer section – Patrick McFadin Apr 08 '16 at 01:52
  • @PatrickMcFadin the issue occures NOT after starting the nodes. they work for 15min to one hours normal, and within minutes they go down. shouldn't they show the behaviour you're describing below from the very beginning, so when the DSE node started? – ramo Apr 08 '16 at 06:51
  • No, because a flush only happens after writes occur on the database. Compactions can start when the nodes start but flush is only after it's been online a bit. – Patrick McFadin Apr 08 '16 at 06:53
  • @PatrickMcFadin i've monitored the log of those nodes, and what comes up shortly before the instance "dies" is this: http://pastebin.com/xgxAE8iN would it be a solution to a.) increase the 600000 millis to somethin else OR b.) to start c* without solr enabled first and when flush/compaction is done, restart the nodes with solr activated again? – ramo Apr 08 '16 at 12:28
  • This line: Timeout while waiting for workers when flushing pool Index Solr is also backing up because the disk has stopped responding. Increasing the timeout will only create more back pressure. All of your processes are starving for disk time and not getting it. Really your only short term solution is adding more nodes to spread out the load and then work towards real disks on the nodes. – Patrick McFadin Apr 08 '16 at 19:19

1 Answers1

8

As I suspected, you are having throughput problems on your disk. Here's what I looked at to give you background. The nodetool tpstats output from your three nodes had these lines:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
FlushWriter                       0         0             22         0                 8
FlushWriter                       0         0             80         0                 6
FlushWriter                       0         0             38         0                 9 

The column I'm concerned about is the All Time Blocked. As a ratio to completed, you have a lot of blocking. The flushwriter is responsible for flushing memtables to the disk to keep the JVM from running out of memory or creating massive GC problems. The memtable is an in-memory representation of your tables. As your nodes take more writes, they start to fill and need to be flushed. That operation is a long sequential write to disk. Bookmark that. I'll come back to it.

When flushwriters are blocked, the heap starts to fill. If they stay blocked, you will see the requests starting to queue up and eventually the node will OOM.

Compaction might be running as well. Compaction is a long sequential read of SSTables into memory and then a long sequential flush of the merge sorted results. More sequential IO.

So all these operations on disk are sequential. Not random IOPs. If your disk is not able to handle simultaneous sequential read and write, IOWait shoots up, requests get blocked and then Cassandra has a really bad day.

You mentioned you are using Ceph. I haven't seen a successful deployment of Cassandra on Ceph yet. It will hold up for a while and then tip over on sequential load. Your easiest solution in the short term is to add more nodes to spread out the load. The medium term is to find some ways to optimize your stack for sequential disk loads, but that will eventually fail. Long term is get your data on real disks and off shared storage.

I have told this to consulting clients for years when using Cassandra "If your storage has an ethernet plug, you are doing it wrong" Good rule of thumb.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Patrick McFadin
  • 1,341
  • 8
  • 10