28

I have microservices(in different programming languages) running on an EC2 instance. On production I notice a few 502 Bad Gateway Errors when these services try to interact with each other. Also in the logs of the requested service it doesn't show any api call is being hit

example service A calls service B, but in service B logs there is nothing to indicate that a call came from service A.

Can it be AWS load balancer issue? Any help would be appreciated. Thanks in advance.

Solution tried: We tried making http/https connection agents in each service but still we get this issue.

Update: In lb logs, the api is logged, but the target response code shows "-" whereas lb response code shows 502 or 504. Does it mean that lb is not able to handle the traffic or my application?

Also what can be the possible solution?

rajat12a
  • 379
  • 1
  • 5
  • 17

2 Answers2

28

We had the same problem.

In our setup, an AWS Application ELB has a target group of 4 EC2 instances. On each of the EC2 instances, there is an Apache2 which forwards to a Tomcat.

The ELB has a default connection KeepAlive of 60 seconds. Apache2 has a default connection KeepAlive of 5 seconds. If the 5 seconds are over, the Apache2 closes its connection and resets the connection with the ELB. However, if a request comes in at precisely the right time, the ELB will accept it, decide which host to forward it to, and in that moment, the Apache closes the connection. This will result in said 502 error code.

The solution is: When you have cascading proxies/LBs, either align their KeepAlive timeouts, or - preferrably - even make them a little longer the further down the line you get.

We set the ELB timeout to 60 seconds and the Apache2 timeout to 120 seconds. Problem gone.

Jan Dörrenhaus
  • 6,581
  • 2
  • 34
  • 45
  • 2
    We figured the issue in our system It was due to the immediate shutdown of ec2 instances, instead of waiting for draining period We already had elb set to 60 seconds and apache at 120seconds – rajat12a May 07 '18 at 17:22
  • We are having same issue currently, when this case happen, can we see any log on Apache side? – Naga Nov 26 '18 at 02:43
  • @Naga We didn't, no. Because the Apache does not notice anything being wrong. The ELB access logs show the request with the 502 status code, and the Apache access logs show nothing. – Jan Dörrenhaus Nov 26 '18 at 12:56
  • @Jan thank you for the information! actually it’s also the same. I checked apache access log and error log, but I could not find anything... we will try the same setting as you and see how. – Naga Nov 26 '18 at 13:48
  • 3
    This was so difficult to figure out - thanks for this Q/A. This resolved my problem as soon as I increased the KeepAliveTimeout – aknosis Dec 29 '18 at 00:11
  • Apparently if you do a packet capture on the "receiving end" (the application server) you may be able to "Se the response `FIN,ACK` packets and the new request `SYN` packets cross paths in most cases, but it is hard to catch." – rogerdpack Mar 05 '19 at 18:48
  • I have the same problems. And I change keepalive timeout is the solution. But It is better if you change keepalive is more than a little bit is better than change 2x times of ALB idle in order to prevent waiting too long in Target Webserver if the connection of ALB and Target is disconnected because of network interruptions. – Son Lam Oct 21 '20 at 03:20
1

Health checks use HTTP2. I got my EC2 instances running NGINX to healthy by adding http2 to the listen 80.

listen 80 default_server http2;