1

"search1" is an AWS elasticsearch service. It has an access policy that only lets traffic through from selected IP addresses. My understanding is that AWS implements this as an ELB in front of a VPC that I cannot access.

"esproxy" is a AWS EC2 instance to act as a proxy to search1. On esproxy, nginx is configured to require (https) basic auth, and anything with that gets proxied to search1.

It works. For a while, hours, or a day. Then every request starts giving "504 Gateway Time-out" errors. nginx still responds instantly to give out 401 auth required errors, but with auth it takes two minutes to get a timeout back. Neither side seems to be under much load when this happens and a restart of nginx fixes it. And really, traffic through the proxy is not heavy, a few thousand hits a day.

Trying to understand the problem, I tried to use openssl like telnet:

openssl s_client -connect search1:443
[many ssl headers including certs shown rapidly]
GET / HTTP/1.1
Host: search1

HTTP/1.1 408 REQUEST_TIMEOUT
Content-Length:0
Connection: Close

It takes about a minute for that 408 timeout to come back to me. Aha, I think, this particular server is having issuses. But then I tried that openssl test from another host. Same delay.

Then I think to myself, hey, curl works to test https, too, now that I know the ssl layer is snappy. Well, with curl access works, even while nginx and openssl are timing out from the esproxy at that same time.

So I think, maybe something about the headers? curl has different headers than I'm typing into openssl.

I modified a low level http/https tool to let me easily send specific headers. And I found it doesn't seem to be lack of or extra headers, but the line endings. nginx (and apache) don't care if you use DOS-style line endings (correct to HTTP spec) or Unix-style (incorrect). The search1 instance (either elastic search itself or the ELB) apparently cares a lot.

Without knowing a whole lot about nginx, I have these questions:

  1. Could the source of my proxy timeouts be a bunch of existing connections caught up with bad request line endings?
    • How can I tell?
    • It might not be since the timeouts are different (one vs two minutes).
  2. Does nginx correct line endings on proxied requests by default?
    • If not, can it be forced to?
  3. AND if the line endings is a red herring, how can I get nginx to help me figure this out? All I see in the log is "upstream timed out (110: Connection timed out) while reading response header from upstream", which doesn't improve my understanding of the issue.

I found this issue earlier in my debugging: nginx close upstream connection after request And I've already fixed the nginx conf to use a 1.1 proxy as outlined there. Relevant conf:

upstream search1 {
  server search1-abcdefghijklmnopqrstuvwxyz.us-east-1.es.amazonaws.com:443;

  # number of connections to keep alive, not time
  keepalive 64;
}

location / {
  proxy_set_header   X-Forwarded-For $remote_addr;
  proxy_set_header   Host "search1-abcdefghijklmnopqrstuvwxyz.us-east-1.es.amazonaws.com"
  # must supress auth header from being forwarded
  proxy_set_header   Authorization "";

  # allow keep-alives on proxied connections
  proxy_http_version 1.1;
  proxy_set_header Connection "";

  proxy_pass         https://search1;
}
Cupcake Protocol
  • 661
  • 3
  • 10
  • Make a note of the IP address that name resolves to. For my money, it's periodically changing, and Nginx doesn't refresh DNS lookups with this config style. Restarting Nginx = fixes the problem? Confirms the theory. The 408 is (or should be) the ELB complaining that you didn't send complete, valid headers within the default 1 minute timeout. Request Timeout != Gateway (Response) Timeout. – Michael - sqlbot Dec 06 '17 at 02:34
  • Your theory is nginx poorly caches DNS lookups? I suppose that's possible, but it seems a rookie network programming mistake. If it is that, how do I get nginx to look up DNS regularlly? – Cupcake Protocol Dec 06 '17 at 17:34
  • DNS lookups can be expensive and time-consuming in a low-latebcy environment, and "back in the day" the idea of a backend machine changing its address was pretty unusual. I don't know if this behavior is still present, but I suspect so -- see https://serverfault.com/q/240476/153161 – Michael - sqlbot Dec 06 '17 at 23:05
  • That says nginx will cache DNS for the length of the TTL. Which is entirely reasonable: it is what you would expect your caching DNS server to do, too. In this case TTL is 60 seconds. And this issue can last hours, not the mere minute for a now bogus DNS entry. – Cupcake Protocol Dec 07 '17 at 18:36
  • You're right, it does say that too. But I'm not convinced that this isn't your problem, because ELBs do change addresses. Have you verified whether the address is changing or whether restarting Nginx resolves the issue? – Michael - sqlbot Dec 08 '17 at 01:22
  • I've been logging the results of dig every five minutes, but I haven't had a repeat of the problem since I started that log. – Cupcake Protocol Dec 08 '17 at 01:24
  • @Michael-sqlbot okay, it happened again last night, about two hours *after* a DNS change. That gives your theory some weight. But how do I make nginx look up the IP addresses again? – Cupcake Protocol Dec 08 '17 at 18:06
  • 1
    Two hours after... so the ELB scaled up or down (a balancer instance was rotated out and a new one came online for whatever reason) and the old instance lingered for a couple of hours after the change. See the 2nd answer on the page I originally linked to. Putting the backend name in a variable and specifying a resolver causes the resolution to be done differently. Use 169.254.169.253 as the resolver if you're in a VPC. That's a valid resolver in any VPC with DNS enabled. – Michael - sqlbot Dec 09 '17 at 01:00

0 Answers0