Google Compute Engine health checks failing

Question

I have a node.js app on two VM instances that I'm trying to load balance with network load balancing. To test that my servers are up and serving, I have the health check request '/health.txt' on my app internal listening port. I have two instances configured identically with the same tags, firewall rules, etc, but the health check fails to one instance continuously, I can do the check using curl on my internal network or from outside and the test works fine on both instances, but the network load balancer always reports one instance as down.

I used ngrep and running from the health instance, I see:

T 169.254.169.254:65374 -> my.pub.ip.addr:3000 [S]
#
T my.pub.ip.addr:3000 -> 169.254.169.254:65374 [AS]
#
T 169.254.169.254:65374 -> my.pub.ip.addr:3000 [A]
#
T 169.254.169.254:65374 -> my.pub.ip.addr:3000 [AP]
GET /health.txt HTTP/1.1.
Host: my.pub.ip.addr:3000.
.

#
T my.pub.ip.addr:3000 -> 169.254.169.254:65374 [A]
#
T my.pub.ip.addr:3000 -> 169.254.169.254:65374 [AP]
HTTP/1.1 200 OK.
X-Powered-By: NitroPCR.
Accept-Ranges: bytes.
Date: Fri, 14 Nov 2014 20:00:40 GMT.
Cache-Control: public, max-age=86400.
Last-Modified: Thu, 24 Jul 2014 17:58:46 GMT.
ETag: W/"2198506076".
Content-Type: text/plain; charset=UTF-8.
Content-Length: 13.
Connection: keep-alive.
.

#
T 169.254.169.254:65374 -> my.pub.ip.addr:3000 [AR]

But on the instance GCE claims is unhealthy, I see this:

T 169.254.169.254:61179 -> my.pub.ip.addr:3000 [S]
#
T 169.254.169.254:61179 -> my.pub.ip.addr:3000 [S]
#
T 169.254.169.254:61180 -> my.pub.ip.addr:3000 [S]
#
T 169.254.169.254:61180 -> my.pub.ip.addr:3000 [S]
#
T 169.254.169.254:61180 -> my.pub.ip.addr:3000 [S]

But if I curl the same file from my healthy instance > unhealthy instance, my 'unhealthy' instance responds fine.

score 7 · Answer 1 · answered Nov 15 '14 at 18:54

I got this back working, after making contact with the Google Compute Engine team. There is a service process on a GCE VM that needs to run on boot, and continue running while the VM is alive. The process is named google-address-manager. It should run at runlevels 0-6. For some reason this service stopped and will not start when one of my VMs boots/reboots. Starting the service manually worked. Here are the steps we went through to determine what was wrong: (This is a Debian VM)

sudo ip route list table all

This will display your route table. In the table, there should be a route to your Load Balancer Public IP:

local lb.pub.ip.addr dev eth0  table local  proto 66  scope host

If there is not, check that google-address-manager is running:

sudo service google-address-manager status

If it not running, start it:

sudo service google-address-manager start

If it starts ok, check your route table, and you should now have a route to your load balancer IP. You can also manually add this route:

sudo /sbin/ip route add to local lb.pub.ip.addr/32 dev eth0 proto 66

We have still not resolved why the address manager stopped and does not start on boot, but at least the LB Pool is healthy

I worked with regretoverflow on this issue, and we found applying the fix from https://github.com/GoogleCloudPlatform/compute-image-packages/pull/121 allowed google-address-manager to start correctly. — David, Nov 24 '14 at 18:39

Google Compute Engine health checks failing

1 Answers1

Linked