"search1" is an AWS elasticsearch service. It has an access policy that only lets traffic through from selected IP addresses. My understanding is that AWS implements this as an ELB in front of a VPC that I cannot access.
"esproxy" is a AWS EC2 instance to act as a proxy to search1. On esproxy, nginx
is configured to require (https) basic auth, and anything with that gets proxied to search1.
It works. For a while, hours, or a day. Then every request starts giving "504 Gateway Time-out" errors. nginx
still responds instantly to give out 401 auth required errors, but with auth it takes two minutes to get a timeout back. Neither side seems to be under much load when this happens and a restart of nginx
fixes it. And really, traffic through the proxy is not heavy, a few thousand hits a day.
Trying to understand the problem, I tried to use openssl
like telnet
:
openssl s_client -connect search1:443
[many ssl headers including certs shown rapidly]
GET / HTTP/1.1
Host: search1
HTTP/1.1 408 REQUEST_TIMEOUT
Content-Length:0
Connection: Close
It takes about a minute for that 408 timeout to come back to me. Aha, I think, this particular server is having issuses. But then I tried that openssl
test from another host. Same delay.
Then I think to myself, hey, curl
works to test https, too, now that I know the ssl layer is snappy. Well, with curl
access works, even while nginx
and openssl
are timing out from the esproxy at that same time.
So I think, maybe something about the headers? curl
has different headers than I'm typing into openssl
.
I modified a low level http/https tool to let me easily send specific headers. And I found it doesn't seem to be lack of or extra headers, but the line endings. nginx
(and apache
) don't care if you use DOS-style line endings (correct to HTTP spec) or Unix-style (incorrect). The search1 instance (either elastic search itself or the ELB) apparently cares a lot.
Without knowing a whole lot about nginx
, I have these questions:
- Could the source of my proxy timeouts be a bunch of existing connections caught up with bad request line endings?
- How can I tell?
- It might not be since the timeouts are different (one vs two minutes).
- Does
nginx
correct line endings on proxied requests by default?- If not, can it be forced to?
- AND if the line endings is a red herring, how can I get
nginx
to help me figure this out? All I see in the log is "upstream timed out (110: Connection timed out) while reading response header from upstream", which doesn't improve my understanding of the issue.
I found this issue earlier in my debugging:
nginx close upstream connection after request
And I've already fixed the nginx
conf to use a 1.1 proxy as outlined there. Relevant conf:
upstream search1 {
server search1-abcdefghijklmnopqrstuvwxyz.us-east-1.es.amazonaws.com:443;
# number of connections to keep alive, not time
keepalive 64;
}
location / {
proxy_set_header X-Forwarded-For $remote_addr;
proxy_set_header Host "search1-abcdefghijklmnopqrstuvwxyz.us-east-1.es.amazonaws.com"
# must supress auth header from being forwarded
proxy_set_header Authorization "";
# allow keep-alives on proxied connections
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_pass https://search1;
}