Recently, we faced downtime for our website. The two ec2 instances on which the website is hosted are behind ELB. We received 2 alerts from ELB that the latency is high and both the nodes are unhealthy. Luckily, I was able to ssh the ec2.
Command "ps aux --sort=-%cpu" on the ec2 showed that one of the process had taken up 97% of the CPU. I killed that process and the server passed the health check of the ELB and the website was up.
The major concern here is AWS Cloudwatch showed that the maximum CPU usage was around 70%. As a result, it did not trigger my alarm which was set for 80%.
I understand that the CPU metrics shown by ps and top commands will always be different from that of cloudwatch due to virtualization. Reading other posts about it, all the posts have discussed about top command showing less CPU usage than cloudwatch. However, in my case it has been the reverse. I need some help understanding the reason behind it.
Also, I checked and found that there were no memory/network issues.
Can anyone help me here. What can I do here so that I am alerted before things go wrong.
EDIT : Below are the CPU credit usage and balance charts