26

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.

The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets

It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.

Any help will be apreciated.

Jordi
  • 1,108
  • 1
  • 8
  • 18

3 Answers3

17

It seems that there are no an easy solution for this.

As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):

It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.

In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).

UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/

They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).

Jordi
  • 1,108
  • 1
  • 8
  • 18
12

I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(

But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.

Eg mine were: jsonPayload: { statusDetails: "failed_to_pick_backend" @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }

If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:

https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340#.btzyusgi6

JasonG
  • 5,794
  • 4
  • 39
  • 67
  • I'm using Apache, but yes, the errors are on POST requests and the error is "backend_connection_closed_before_data_sent_to_client". I have changed the KeepAliveTimeout configuration of Apache to 65 seconds and the problem was solved. Thank you for your help JasonG! :) – Jordi Dec 28 '16 at 09:52
  • There seems to be fewer errors but still happening. I'll check it out in a few hours. – Jordi Dec 28 '16 at 09:59
  • I think you need the timeout to be longer than 600s. – JasonG Dec 31 '16 at 21:36
  • "To fix this race condition, set “keepalive_timeout 650;” in nginx so that your timeout is longer than the 600 second timeout in the GCP HTTP(S) Load Balancer. This causes the load balancer to be the side that closes idle connections, rather than nginx, which fixes the race condition! (This is not a 100% accurate description for how closing TCP connections works, but it’s fair enough for here)." – JasonG Dec 31 '16 at 21:36
  • In my case, it is IIS 10.0 and there are no details about IIS mentioned in the google documentation. I had to raise a ticket to the Google cloud team. The details is mentioned in the following stackoverflow question and answer - https://stackoverflow.com/a/50201711/1751464 – SanS May 06 '18 at 23:49
  • 1
    It's 2019 and this is exactly what's happening on our App Engine Flex instance. – EFreak Jan 29 '19 at 03:44
0

Sometimes you can got not explained 502 errors because yours AutoScalingGroup create instances by EVEN logic. After I changed to BALANCED scheme 99% of errors just gone. You can read about it: https://cloud.google.com/compute/docs/instance-groups/regional-mig-distribution-shape