2

We're looking for the best way to deploy a small production Cassandra cluster (community) on EC2. For performance reasons, all recommendations are to avoid EBS.

But when deploying the Datastax provided AMI with Ephemeral storage, whenever the ephemeral storage is wiped out the instance dies permanently. (Start + Stop manually, or sometimes triggered by AWS for maintenance) will render the instance unusable. OpsCenter fails to fix the instance after a reboot and the instance does not recover on its own.

I'd expect the instance to launch itself back up, run some script to detect that the ephemeral storage is wiped, and sync with the cluster. Since it does not the AMI looks appropriate only for dev tasks.

Can anyone please help us understand what is the alternative? We can live with a momentary loss of a node due to replication but if the node never recovers and a new cluster is required this looks like a dead end for a production environment.

  1. is there a way to install Cassandra on EC2 so that it will recover from an Ephemeral storage loss?

  2. If we buy a license for an enterprise edition will this problem go away?

  3. Does this meant that in spite of poor performance, EBS (optimized) with PIOPS is the best way to run Cassandra on AWS?

  4. Is the recommendation to just avoid stopping + starting the instance and hope that AWS will not retire or reallocate their host machine? What is the recommendation in this case?

  5. What about AWS rolling update? Upgrading one machine (killing it) and starting it again, then proceeding to next machine will erase all cluster data, since machines will be responsive (unlike Cassandra on those). That way it can destroy small (e.g. 3 node) cluster.

  6. Has anyone had good experience with payed services such as Instacluster?

  • To clarify - amazon often replaces a host instance for maintenance. also, these events are sometimes caused by a malfunction. So a requirement on a production env. is that the environment as a whole will survive a spontaneous migration of one instance. But the best scenario for a recovery of a single node changing hosts is a replacement of the entire datacenter. this seems like an overkill and does not sound like a very good deployment strategy. we're looking for a best practice that will enable us the performance of ephemeral storage with automatic recovery from sporadic EC2 malfunctions. – Ron Bresler Aug 20 '15 at 11:38
  • Those are very good questions. I would like to hear somebody's else answer for those, however I can also share my knowledge: 1. No, 2. No, 3. No, there were multiple studies on that, last major one around 2013/2014 - but maybe that changed. 4. Guys at Netflix use Priam: https://github.com/Netflix/Priam however it almost hasn't any documentation + I haven't found any blog posts describing successful instalation. BTW. I've added one more related question (the 5th one). – piotrwest Aug 20 '15 at 21:55
  • Thanks, I'll check out Priam. I see it's also recommended [here](http://stackoverflow.com/questions/21386671/best-practice-cassandra-setup-on-ec2-with-large-amount-of-data) with a reference to a complete setup and instance upgrade flow [here](http://aryanet.com/blog/shrinking-the-cassandra-cluster-to-fewer-nodes). however, it does not provide full automatic node recovery from what I read. – Ron Bresler Aug 23 '15 at 09:31
  • @Ron, What did you end up doing for this? – runios Feb 06 '16 at 10:46
  • @runios, we wanted to use Cassandra for a very specific purpose, after looking at the deployment we needed we decided it is an overkill and went with a different solution. – Ron Bresler Jul 14 '16 at 12:44

1 Answers1

0

New docs from Datastax actually indicate that EBS Optimized GP2 SSD backed instances can be used for production workloads. With EBS backed, you can easily do snapshots which virtually eliminate the chance of data loss on a node, and it makes it so that they are easily migrated to a new host by a simple start/stop.

With ephemeral, you basically have to plan around failure, consider if your entire cluster is in a single region (SimpleSnitch) and that region goes down.

http://docs.datastax.com/en/cassandra/3.x/cassandra/planning/planPlanningEC2.html

Louis T.
  • 62
  • 6