2

Recently, we faced downtime for our website. The two ec2 instances on which the website is hosted are behind ELB. We received 2 alerts from ELB that the latency is high and both the nodes are unhealthy. Luckily, I was able to ssh the ec2.

Command "ps aux --sort=-%cpu" on the ec2 showed that one of the process had taken up 97% of the CPU. I killed that process and the server passed the health check of the ELB and the website was up.

The major concern here is AWS Cloudwatch showed that the maximum CPU usage was around 70%. As a result, it did not trigger my alarm which was set for 80%.

I understand that the CPU metrics shown by ps and top commands will always be different from that of cloudwatch due to virtualization. Reading other posts about it, all the posts have discussed about top command showing less CPU usage than cloudwatch. However, in my case it has been the reverse. I need some help understanding the reason behind it.

Also, I checked and found that there were no memory/network issues.

Can anyone help me here. What can I do here so that I am alerted before things go wrong.

EDIT : Below are the CPU credit usage and balance charts

CPU Credit usage

CPU credit balance

  • 2
    Collection interval for CloudWatch is 5 minutes, could it be that you looked at CloudWatch before it could collect the 90% usage? – kosa Apr 27 '17 at 13:56
  • Have you enabled detailed metrics on the instance? Also, what instance type are they? If t2, there are other factors at play. – Michael - sqlbot Apr 27 '17 at 23:39
  • @Nambari Detailed Monitoring is enabled. So, cloudwatch gives the details every minute. Also, cloudwatch is only showing the maximum statistic upto 70% after the 90% usage – Bhavya Keniya Apr 29 '17 at 10:58
  • @Michael-sqlbot Detailed metrics are enabled on the server and it is a t2.large server – Bhavya Keniya Apr 29 '17 at 10:59
  • Take a look at the CPU Credit Usage and CPU Credit Balance metrics graphs. In fact, if you can show them to us (edit into the question) so that they line up over time, that would be even better. I suspect you'll want to monitor all three values but will explain further, if you have some data that matches what I think may be the issue. – Michael - sqlbot Apr 29 '17 at 11:15
  • @Michael-sqlbot Thank you for pointing out to CPU Credit usage and balance charts. I have added my charts and after reading - http://stackoverflow.com/questions/28984106/whats-is-cpu-credit-balance-in-ec2 , the cpu credit usage chart makes a lot of sense. But, I still don't know why CPU usage stats show up to 70% – Bhavya Keniya May 02 '17 at 10:22
  • Thank you for adding these. I ran a simulation over several hours this weekend based on your initial question, and was able to create a condition that took the instance's CPU %idle near 0% while showing only 15% CPU utilization in CloudWatch, but that was an artifact of me deliberately exhausting the credit balance to near 0 (which is why it took several hours) -- so that isn't your issue. However... – Michael - sqlbot May 02 '17 at 11:32
  • 1
    A t2.large is a dual-core machine -- so from inside the instance, the max available *CPU capacity* should be measured, not as 100%, but as 200%. If ps reports 97%, that almost certainly means *of one core*, which from the outside (CloudWatch) should be reported as ~48.5% utiilization of total *machine* capacity. The credit utilization of a t2.large with both cores utilized at 100% would be 10 credits every 5 minutes (100% of 1 core for 1 minute costs 1 credit, and it's granular to microseconds). Yours is closer to 5, which tends to confirm the rest of the above. Your thoughts? – Michael - sqlbot May 02 '17 at 11:45

0 Answers0