4

I've been running two identical medium CPU instances on Amazon behind a load balancer for a few months. I've noticed the load balancer has a habit of declaring an instance unhealthy on a fairly regular basis, taking the instance down and replacing with a new instance of the defined AMI.

That's technically the correct thing to do, I just don't understand why it thinks the instance is unhealthy, occasionally. I've been monitoring the health check ports over the last 3 days and the check every 60 seconds constantly works when using the public DNS of the two instances. The load balancer has declared an instance unhealthy 3 times over that period and replaced it. The instances are massively overpowered for what I need, purposefully, so I can rule that out from being an issue.

With the ELB architecture, I know this doesn't technically matter, but the rate of unhealthies has gone from one per week to over one per day. Each instance spun up costs me an extra hour of instance cost. If this gets worse, the cost will become non-trivial, but more importantly it doesn't give me faith in the ELB internals.

This isn't the same question as this one, mine is an occasional failure. For information, I'm using the EU/Ireland data center and my unhealthy criterion is 10 failures on my port (8080) over a 5 minute period (which is longer than I'd really like to set anyway, I don't want traffic going to the instances failing to get a response for 5 minutes).

I know someone is going to suggest contacting Amazon, but I don't have a support contract and anyone who's tried this knows the kind of answer I'll get, if I get one at all. I really like the idea of this thing, it just doesn't seem that stable to me.

Community
  • 1
  • 1
  • Do you use Auto Scaling? Aditional Instances might be started by certain condition defined in config. If you have 'Auto Scaling Command Line Tools' installed, run `as-describe-auto-scaling-groups --headers` to list your Auto Scaling Groups. Pay attention to the last columns like: MIN-SIZE, MAX-SIZE, DESIRED-CAPACITY. – Roman Newaza Jan 12 '12 at 00:55
  • What are you polling for a health check, ie what is responding on port 8080? I've always just had a static file sitting there and the health check is really just a check to ensure the web server (and server) are up and running. Also, how many requests are you getting through the ELB? looks like there may be some known issues in very high traffic situations - https://forums.aws.amazon.com/thread.jspa?messageID=261530 – jaminto Jan 17 '12 at 02:45
  • Yes, we are polling for empty file. About number of requests - sometimes it's 3000 a second – Roman Newaza Jun 06 '12 at 01:02

1 Answers1

1

The only reason to have the instance in an unhealthy state is the failure of the health check. Make sure your application does not have load spikes, monitor the performance with some third party tools like nagios, cacti, monit and check the system during this spikes.

Paul Ma
  • 609
  • 6
  • 8