9

We have an Express web API deployed that gets a decent but still relatively small amount of traffic (around 10 requests per second on average) that is running on an EC2 ubuntu server that is proxied to through NGINX. Every so often, a request will hang, and if the client waits long enough a line containing the below will be output to the NGINX error log:

upstream timed out (110: Connection timed out) while connecting to upstream

I've already tried the suggested solution here but it seems to have had no effect. This only happens to our knowledge about 1-3 times every minute, but I'm only going off of these logs. If the client refreshes their page or navigates away before the request times out, there seems to be no record.

The error message obviously indicates that there's something wrong connecting to the upstream server, but why would this happen so infrequently? There is also absolutely no pattern in the URLs that are causing this issue and the application being proxied to remains available to my knowledge. Here's an idea of our NGINX config:

user www-data;
worker_processes 4;
pid /run/nginx.pid;
events {
worker_connections 10000;
}
worker_rlimit_nofile 25000;

http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
client_max_body_size 15M;
include /etc/nginx/mime.types;
include /etc/nginx/conf.d/ *.conf; //Added space before star because so formatting was turning it into a comment


include /etc/nginx/sites-enabled/ *;
default_type application/octet-stream;

log_format nginx_json '{ "timestamp": "$time_local", '
                      ' "request_ip": "$remote_addr", '
                      ' "request_user": "$remote_user", '
                      ' "request_bytes_sent": "$bytes_sent", '
                      ' "response_status": "$status", '
                      ' "request": "$request", '
                      ' "request_method": "$request_method", '
                      ' "http_referrer": "$http_referer", '
                      ' "http_user_agent": "$http_user_agent", '
                      ' "request_id": "$request_id", '
                      ' "server_name": "$server_name",'
                      ' "response_time": "$upstream_response_time" }';

access_log /var/log/nginx/access.log nginx_json;
error_log /var/log/nginx/error.log;

gzip on;
gzip_disable "msie6";
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:ECDHE-RSA-AES128-GCM-SHA256:AES256+EECDH:DHE-RSA-AES128-GCM-SHA256:AES256+EDH:ECDHE-RSA-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES256-SHA256:DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA:DHE-RSA-AES128-SHA:ECDHE-RSA-DES-CBC3-SHA:EDH-RSA-DES-CBC3-SHA:AES256-GCM-SHA384:AES128-GCM-SHA256:AES256-SHA256:AES128-SHA256:AES256-SHA:AES128-SHA:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4";
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_dhparam /etc/ssl/certs/dhparam.pem;

resolver 127.0.0.1 valid=30s;

server {
 listen 80;
 server_name a.mysite.com;
 return 301 https://$server_name$request_uri;
}

 server {
     listen 443 ssl;
     server_name a.mysite.com;
     add_header Strict-Transport-Security "max-age=31536000";
     add_header Cache-Control no-cache;
     location /api {
         proxy_pass http://1.2.3.4:3001;
         proxy_set_header Host $host;
         proxy_set_header X-Request-Id $request_id;
         proxy_set_header Connection "";
         proxy_http_version 1.1;
     }
     location /ui2 {
         set $uiHost https://abc.cloudfront.net/ui2/index.html?v=1503438694163;
         proxy_pass $uiHost;
     }
     location / {
         set $uiHost https://abc.cloudfront.net/ui/index.html?v=1504012942606;
         proxy_pass $uiHost;
     }
     ssl_certificate /path/to/certificate;
     ssl_certificate_key /path/to/certificate/key;
 }

The server blocks at the bottom of this are repeated for multiple subdomains, with the /api path usually pointing to the same server on different ports. One subdomain gets the bulk of the traffic. The upstream server (1.2.3.4 in the example) is configured with EC2 security groups to only allow access from the NGINX server. Obviously the error message indicates that there could be something wrong with the upstream server that's running the Express app, but nothing in our logs indicates this is occurring.

A few final things to note:

  • I recently increased the worker_connections from 768 to 10000 which seems to have made the issue occur slightly less frequently. We never reach anywhere close to the connection limit however, and the connections are being closed.
  • After this increase, each time a reload is done on NGINX, we do not get any of these errors for around 10 minutes. This is the primary reason for me thinking NGINX is the culprit, but I'm no expert.
  • I did see a previous post somewhere when Googling around that the proxy_set_header Host $host; statement could be causing this, which didn't make much sense to me but something to think about. I have not tested removing this.
  • The API server with the express app always appears to be performing normally, and that server is not under heavy load from what we can tell
  • This issue does not occur on the proxies to cloudfront

Anyone have anything obvious or ideas about what to investigate further? Could really use some help here as we're pretty lost.

Update: I added some extra variables to the logs as recommended and was able to tie the error to an access log. Here's the relevant variables:

{

    "offset": 64270628,
    "response_status": "504",
    "upstream_header_time": "60.001",
    "input_type": "log",
    "source": "/var/log/nginx/access.log",
    "request_method": "GET",
    "http_user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "@timestamp": "2017-08-30T15:29:15.981Z",
    "upstream_connect_time": "60.001",
    "request_user": "-",
    "response_time": "60.001",
    "request_bytes_sent": "345",
    "request_id": "90a41e2224cc4b2c1d3c23d544b9146c",
    "timestamp": "30/Aug/2017:15:29:15 +0000"
  }
Ryan Bickham
  • 91
  • 1
  • 5
  • extended access_log format with variables such *$upstream_connect_time* and *$upstream_header_time* would give some information – Anatoly Aug 30 '17 at 09:04
  • @Anatoly I added this to our logs and was able to successfully match up the error log with an access log. The request does seem to have timed out. I added some of the variables that could be relevant above – Ryan Bickham Aug 30 '17 at 15:33
  • In the logs above it takes 60 seconds for TCP connect which is the key. How many connections in the TIME_WAIT state do you have while error is returned back? – Anatoly Aug 30 '17 at 15:46
  • @Anatoly About 600-800 are in timewait on average around when this error is occurring (see below for ss output). However I have seen the error happen during periods of less traffic. I did find [this guide](https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads) which has some sysctl changes I could try. `TCP: 889 (estab 223, closed 655, orphaned 3, synrecv 0, timewait 655/0), ports 0` – Ryan Bickham Aug 30 '17 at 18:18
  • Is nodejs server deployed in clustered mode and what is the number of CPUs on the machine? I want to figure out if a single instance of express is the bottleneck here. I have seen similar cases in the past. – abskmj Aug 30 '17 at 19:17
  • @abskmj We have multiple apps running in clustered mode on the same box (served on different ports). The one with the most traffic and the only one we've seen this error from is running in a four process cluster. The server has 4 CPUs (r3.xlarge). This server is actually much too big for the usage at the moment. It sits at <10% CPU usage and <20% memory usage during peak hours. – Ryan Bickham Aug 30 '17 at 19:39
  • Thanks for the information. Are you seeing similar timeouts on other nodejs applications? – abskmj Aug 30 '17 at 19:45
  • @abskmj We are not. Only with that specific instance of the app running on that port. But again, this one instance retrieves significantly more requests than any other. The node app continues to serve many requests while these errors occur so I don't think there's any event loop blocking going on or anything. We've also never replicated this through application load testing. – Ryan Bickham Aug 30 '17 at 19:50
  • How are the response timings of the upstream server? I see NGINX logs config has those logged `"response_time": "$upstream_response_time"`. I'm (almost) convinced that this is not an issue with nodejs. :thumbs_up: – abskmj Aug 30 '17 at 19:59
  • @abskmj Appreciate the sanity check on the node side for sure! They're normal besides the requests that are failing (~140ms is our average API wide). And again I don't believe these requests are even reaching the upstream server because we're getting no logs for them there. It could (maybe) be a networking related error with the server the app is running on itself perhaps but I think our app is fine. – Ryan Bickham Aug 30 '17 at 20:05
  • Just to check: what the version of Nginx, NodeJS and Linux kernel? – Anatoly Aug 30 '17 at 20:07
  • @Anatoly Nginx 1.12.0 Node 6.6.0 Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64) – Ryan Bickham Aug 30 '17 at 20:13
  • 1
    @RyanBickham were you able to reach to root cause of the behavior.I am observing similar behavior. Requests don't seem to be reaching to upstream nodejs servers but nginx error logs gives connection timeout which is fairly uniformly distributed throughout day. – khallas Jan 13 '18 at 04:07

1 Answers1

0

IN MY CASE, I had the exact same issue and with an NGINX reverse proxy inside a private subnet & NAT Gateway associated with it.

The problem was that the allocated number of connections available for the NGINX instance -which was set on the NAT Gateway- was not enough and did not autoscale.

It was not the issue of NGINX nor the backend API behind the NGINX.

This setting was, in the case of GCP, modifiable at CloudNAT console. For the specifics, please refer to this answer.

snowpeak
  • 797
  • 9
  • 25