25

I have a cloud of server instances running at Amazon using their load balancer to distribute the traffic. Now I am looking for a good way to gracefully scale the network down, without causing connection errors on the browser's side.

As far as I know, any connections of an instance will be rudely terminated when removed from the load balancer.

I would like to have a way to inform my instance like one minute before it gets shut down or to have the load balancer stop sending traffic to the dying instance, but without terminating existing connections to it.

My app is node.js based running on Ubuntu. I also have some special software running on it, so I prefer not to use the many PAAS offering node.js hosting.

Thanks for any hints.

  • Are you using ELB to maintain user sessions that are only valid on specific EC2 instances? And if so, how long do those sessions last? – Ray Vahey Oct 10 '11 at 17:42
  • I don't use ELB for user session management - maybe I will do so for performance reasons only, but I do not rely on this feature. Session management is being done by a central database that all nodes have access to. – Johann Philipp Strathausen Oct 11 '11 at 06:39
  • 6
    Here's the thread about ELB rudely dropping live connections when an instance is removed: https://forums.aws.amazon.com/thread.jspa?threadID=61278 Amazon asked for feedback, so feel free to add your +1 for fixing this. – Eric Hammond Jan 04 '12 at 23:36

6 Answers6

18

I know this is an old question, but it should be noted that Amazon has recently added support for connection draining, which means that when an instance is removed from the loadbalancer, the instance will complete requests that were in progress before the instance was removed from the loadbalancer. No new requests will be routed to the instance that was removed. You can also supply a timeout for these requests, meaning any requests that run longer than the timeout window will be terminated after all.

To enable this behaviour, go to the Instances tab of your loadbalancer and change the Connection Draining behaviour.

Jaap Haagmans
  • 6,232
  • 1
  • 25
  • 30
16

This idea uses the ELB's capability to detect an unhealthy node and remove it from the pool BUT it relies upon the ELB behaving as expected in the assumptions below. This is something I've been meaning to test for myself but haven't had the time yet. I'll update the answer when I do.

Process Overview

The following logic could be wrapped and run at the time the node needs to be shut down.

  1. Block new HTTP connections to nodeX but continue to allow existing connections
  2. Wait for existing connections to drain, either by monitoring existing connections to your application or by allowing a "safe" amount of time.
  3. Initiate a shutdown on the nodeX EC2 instance using the EC2 API directly or Abstracted scripts.

"safe" according to your application, which may not be possible to determine for some applications.

Assumptions that need to be tested

We know that ELB removes unhealthy instances from it's pool I would expect this to be graceful, so that:

  1. A new connection to a recently closed port will be gracefully redirected to the next node in the pool
  2. When a node is marked Bad, the already established connections to that node are unaffected.

possible test cases:

  • Fire HTTP connections at ELB (E.g. from a curl script) logging the results during scripted opening an closing of one of the nodes HTTP ports. You would need to experiment to find an acceptable amount of time that allows ELB to always determine a state change.
  • Maintain a long HTTP session, (E.g. file download) while blocking new HTTP connections, the long session should hopefully continue.

1. How to block HTTP Connections

Use a local firewall on nodeX to block new sessions but continue to allow established sessions.

For example IP tables:

iptables -A INPUT -j DROP -p tcp --syn --destination-port <web service port>
Ray Vahey
  • 3,065
  • 1
  • 24
  • 25
  • Thanks for the ideas! Unfortunately, assumption number 2 seems to be the important one that is being missing. As far as I know, a node exists about 40-60 seconds after being detected as ill, with no guarantee. But sadly, it is already being removed immediately without any warning from the ELB and any existing connections are terminated and not forwarded to another node. This is what I know, but I could try to experiment with it... – Johann Philipp Strathausen Oct 11 '11 at 18:57
  • It's good that it detects it as down and removes it, that's what we want. But also removing the existing connections would certainly give us problems, I wouldn't rule this out without a test because I've seen other loadbalancing software work this way... Otherwise, are you able to use sub domains with the load balancer so that it only establishes the initial connection? E.g. balance.domain.com diverts to nodeX.domain.com? Where nodeX is the next one in a round-robin pool etc. – Ray Vahey Oct 12 '11 at 04:29
  • ELB itself doesn't support using sub-domains - but a machine could know about its own name. I could even have a set of machines mapped to domain names via dns entries - don't know how to do it automatically though. Since I pay most of the money to instances running, and paused instances are pretty cheap, this may be an option. So I'd use the ELB for the initial distribution, and from then on maybe use the node a user has been assigned to. This may work! Any idea on how to best use subdomains instead of AWS machine urls? (I want to use wildcard-ssl for a single domain). – Johann Philipp Strathausen Oct 14 '11 at 10:04
  • hmm, this wouldn't work with a wildcard cert because the cert must be tied to a single static IP. You'd need individual certs for each node. – Ray Vahey Oct 14 '11 at 11:08
  • I kind of like the general idea though. I gave you the green checkmark because of that :-) – Johann Philipp Strathausen Oct 16 '11 at 15:55
  • By the way, I don't think the cert has to be tied to a static IP, but to the domain. No? – Johann Philipp Strathausen Oct 18 '11 at 09:44
  • 1
    Thanks, I just searched and that looks like I was wrong about binding to one IP. http://stackoverflow.com/questions/909453/single-ssl-cert-on-multiple-servers – Ray Vahey Oct 18 '11 at 12:10
  • If you're just removing individual nodes rather than stopping the service altogether, Surely it would be a lot more sensible to REJECT the connection, allowing the load balancer to reassign the request immediately rather than wait for a timeout using DROP? – symcbean Jun 12 '15 at 23:33
7

The recommended way for distributing traffic from your ELB is to have an equal number of instances across multiple availability zones. For example:

ELB

  • Instance 1 (us-east-a)
  • Instance 2 (us-east-a)
  • Instance 3 (us-east-b)
  • Instance 4 (us-east-b)

Now there are two ELB APIs of interest provided that allow you to programmatically (or via the control panel) detach instances:

  1. Deregister an instance
  2. Disable an availability zone (which subsequently disables the instances within that zone)

The ELB Developer Guide has a section that describes the effects of disabling an availability zone. A note in that section is of particular interest:

Your load balancer always distributes traffic to all the enabled Availability Zones. If all the instances in an Availability Zone are deregistered or unhealthy before that Availability Zone is disabled for the load balancer, all requests sent to that Availability Zone will fail until DisableAvailabilityZonesForLoadBalancer calls for that Availability Zone.

Whats interesting about the above note is that it could imply that if you call DisableAvailabilityZonesForLoadBalancer, the ELB could instantly start sending requests only to available zones - possibly resulting in a 0 downtime experience while you perform maintenance on the servers in the disabled availability zone.

The above 'theory' needs detailed testing or acknowledgement from an Amazon cloud engineer.

Faraz
  • 113
  • 5
4

Seems like there have already been a number of responses here and some of them have good advice. But I think that in general your design is flawed. No matter how perfect you design your shutdown procedure to make sure that a clients connection is closed before shutting down a server you're still vulnerable.

  1. The server could loose power.
  2. Hardware failure causes server to fail.
  3. Connection could be closed by a network issue.
  4. Client looses internet or wifi.

I could go on with the list, but my point is that instead of designing for the system to always work correctly. Design it to handle failures. If you design a system that can handle a server loosing power at any time then you've created a very robust system. This isn't a problem with the ELB this is a problem with the current system architecture you have.

bwight
  • 3,300
  • 17
  • 21
  • 2
    You're right, there are plenty of scenarios that could cause an instant loss of connection, but I think it's a question of degree. Auto scaling is designed to be commonplace; instances are billed on the hour, so you might scale up or down every hour... that's a lot of lost connections. – Stephen Apr 04 '13 at 07:27
2

A caveat that was not discussed in the existing answers is that ELBs also use DNS records with 60 second TTLs to balance load between multiple ELB nodes (each having one or more of your instances attached to it).

This means that if you have instances in two different availability zones, you probably have two IP addresses for your ELB with a 60s TTL on their A records. When you remove the final instances from such an availability zone, your clients "might" still use the old IP address for at least a minute - faulty DNS resolvers might behave much worse.

Another time ELBs wear multiple IPs and have the same problem, is when in a single availability zone you have a very large number of instances which is too much for one ELB server to handle. ELB in that case will also create another server and add its IP to the list of A records with a 60 second TTL.

Evgeny
  • 6,533
  • 5
  • 58
  • 64
  • The stated contract, as I understand it, is that traffic routed (due to stale DNS) to an AZ with no healthy instances will be forwarded by ELB to an AZ that does have healthy instances. You can test this out by setting up 2 instances in different AZs, shut one down, then force traffic to the ELB IP for the shut down AZ and see if it still serves a healthy response. – Johnny C Jul 06 '13 at 21:38
2

I can't comment cause of my low reputation. Here is some snippets I crafted that might be very useful for someone out there. It utilizes the aws cli tool to check when an instance been drained of connections.

You need an ec2-instance with provided python server behind an ELB.

from flask import Flask
import time

app = Flask(__name__)

@app.route("/")
def index():
    return "ok\n"

@app.route("/wait/<int:secs>")
def wait(secs):
    time.sleep(secs)
    return str(secs) + "\n"

if __name__ == "__main__":
    app.run(
        host='0.0.0.0',
        debug=True)

Then run following script from local workstation towards the ELB.

#!/bin/bash

which jq >> /dev/null || {
   echo "Get jq from http://stedolan.github.com/jq"
}

# Fill in following vars
lbname="ELBNAME"
lburl="http://ELBURL.REGION.elb.amazonaws.com/wait/30"
instanceid="i-XXXXXXX"

getState () {
    aws elb describe-instance-health \
        --load-balancer-name $lbname \
        --instance $instanceid | jq '.InstanceStates[0].State' -r
}

register () {
    aws elb register-instances-with-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

deregister () {
    aws elb deregister-instances-from-load-balancer \
        --load-balancer-name $lbname \
        --instance $instanceid | jq .
}

waitUntil () {
    echo -n "Wait until state is $1"
    while [ "$(getState)" != "$1" ]; do
        echo -n "."
        sleep 1
    done
    echo
}

# Actual Dance
# Make sure instance is registered. Check latency until node is deregistered

if [ "$(getState)" == "OutOfService" ]; then
    register >> /dev/null
fi

waitUntil "InService"

curl $lburl &
sleep 1

deregister >> /dev/null

waitUntil "OutOfService"
Loa
  • 21
  • 2
  • See http://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html#standby-instance-health-status - I think this contains a better approach and should be quicker. As I understand it, the approach above will probably lead to the autoscaling group creating a new node as you de-register one to update... – David Goodwin May 31 '16 at 12:34