1

I would like to know a system by which I can keep track of multiple aws accounts, somewhere around 130+ accounts with each account containing around 200+ servers.
I wanna know methods to keep track of machine failure, service failure etc.
I also wanna know methods by which I can automatically turn up a machine if the underlying hardware failed or the machine terminated while on spot.
I'm open to all solutions including chef/terraform automation, healing scripts etc.

You guys will be saving me a lot of sleepless nights :)

Thanks in advance!!

  • It's a broad question and you might consider offerings from AWS Partner services as well. Have you explored any options? – sudo Feb 03 '18 at 22:26

2 Answers2

2

This is purely my take on implementing your problem statement.

1) Well.. for managing and keeping track of multiple aws accounts you can use AWS Organization. This will help you manage centrally with one root account all the other 130+ accounts. You can enable consolidated billing as well.

2) As far as keeping track of failures... you may need to customize this according to your requirements. For example: You can build a micro service on top of docker containers or ecs whose sole purpose is to keep track of failures, generate a report and push to s3 on a daily basis.You can further create a dashboard using AWS quicksight out of this reports in S3.

There can be another micro service which will rectify the failures. It just depends on how exhaustive and fine grained you want your implementation to be.

3) For spawning instances when spot instances are terminated, it can be achieved through you simple autoscaling configurations. Here are some of the articles you may want to go through which will give you some ideas:

Using Spot Instances with On-Demand instances

Optimizing Spot Fleet+Docker with High Availability

Madhukar Mohanraju
  • 2,793
  • 11
  • 28
0

AWS Organisations are useful for management. You can also look at multiple account billing strategy and security strategy. A shared services account with your IAM users will make things easier.

Regarding tracking failures you can set up automatic instance recovery using CloudWatch. CloudWatch can also have alerts defined that will email you when something happens you don't expect, though setting them up individually could be time consuming. At your scale I think you should look into third party tools.

Tim
  • 616
  • 7
  • 23