11

I am running nginx as part of the docker-compose template. In nginx config I am referring to other services by their docker hostnames (e.g. backend, ui). That works fine until I do that trick:

docker stop backend
docker stop ui
docker start ui
docker start backend

which makes backend and ui containers to exchange IP addresses (docker provides private network IPs on a basis of giving the next IP available in CIDR to each new requester). This 4 commands executed imitate some rare cases when both upstream containers got restarted at the same time but the nginx container did not. Also, I believe, this should be a very common situation when running pods on Kubernetes-based clusters.

Now nginx resolves backend host to ui's IP and ui to backend's IP. Reloading nginx' configuration does help (nginx -s reload). Also, if I do nslookup from within the nginx container - the IPs are always resolved correctly.

So this isolates the problem to be a pure nginx issue around the DNS caching.

The things I tried:

  1. I have the resolver set under the http {} block in nginx config:
resolver 127.0.0.11 ipv6=off valid=10s;
  1. Most common solution proposed by the folks on the internet to use variables in proxy-pass (this helps to prevent nginx to resolve and cache DNS records on start) - that did not make ANY difference at all:
server {
  <...>
  set $mybackend "backend:3000";
  location /backend/ {
    proxy_pass http://$mybackend;
  }
}
  1. Tried adding resolver line into the location itself
  2. Tried setting the variable on the http{} block level, using map:
http {  
  map "" $mybackend {
    default backend:3000;
  }
  server {
   ...
  }
}
  1. Tried to use openresty fork of nginx (https://hub.docker.com/r/openresty/openresty/) with resolver local=true

None of the solutions gave any effect at all. The DNS caches are only wiped if I reload nginx configuration inside of the container OR restart the container manually.

My current workaround is to use static docker network declared in docker-compose.yml. But this has its cons too.

Nginx version used: 1.20.0 (latest as of now) Openresty versions used: 1.13.6.1 and 1.19.3.1 (latest as of now)

Would appreciate any thoughts

UPDATE 2021-09-08: Few months later I am back to solving this same issue and still no luck. Really looks like the bug in nginx - I can not make nginx to re-resolve the dns names. There seems to be no timeout to nginx' dns cache and none of the options listed above to introduce timeouts or trigger dns flush work.

UPDATE 2022-01-11: I think the problem is really in the nginx. I tested my config in many ways a couple months ago and it looks like something else in my nginx.conf prevents the valid parameter of the resolver directive from working properly. It is either the limit_req_zone or the proxy_cache_path directives used for request rate limiting and caching respectively. These just don't play nicely with the valid param for some reason. And I could not find any information about this anywhere in nginx docs. I will get back to this later to confirm my hypothesis.

  • Did you file a bug report on nginxs github? Looks like a bug to me. – The Fool May 10 '21 at 22:14
  • @TheFool I did not file a bug yet but I probably will. What's concerning is that adding the variable to the proxy_pass seems to help others as this is the common (yet undocumented) solution used by the others. But this does not seem to work at all for me. Neither it worked for me a couple years ago too when I tried it last time. So I am wondering if I am missing something. – Nikita Mendelbaum May 11 '21 at 17:20
  • 2
    Workaround or not. I have read the documentation yesterday. It is clearly stated that when giving the `valid` parameter like you do, nginx is supposed to ignore TTL and recheck at that interval. So latest after 10 seconds nginx should start routing correctly. > By default, nginx caches answers using the TTL value of a response. An optional valid parameter allows overriding it. http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver. – The Fool May 11 '21 at 17:44
  • Another interesting thing to test would be to wait 10 minutes and see if it works then. Since the default TTL the docker resolver sends is 600 seconds. After that, nginx should respect the TTL value even if it didn't respect the `valid` paramater. You can also try to lower the TTL dockers resolver is using. – The Fool May 11 '21 at 17:45
  • I will need to get back to this and test the suggested. I am currently derailed with other projects but should be able to get my hands on this soonish. And will provide an update later. – Nikita Mendelbaum Jun 01 '21 at 16:58
  • @TheFool Just now I had the chance to test it - even after 10 minutes it is the same. Nginx still fails to re-resolve the "backend" dns name and returning the 502 Bad Gateway. What helps is either "nginx -s reload" inside of container or restarting the whole nginx container. Seems like an nginx bug to me - I can't seem to make nginx re-resolve DNS by any method other than reloading the config manually. – Nikita Mendelbaum Sep 07 '21 at 17:19

4 Answers4

4

Maybe it's because nginx's DNS resolver for upstream servers only works in the commercial version, nginx plus?

https://www.nginx.com/products/nginx/load-balancing/#service-discovery

  • This is true however other developers on the internet claim that passing the dns name to proxy_pass using the variable does the trick as this forces Nginx to resolve the dns name on the fly. I would think this undocumented behavior was changed by nginx team but it looks like people keep using this trick for the last couple years while it never worked for me. – Nikita Mendelbaum May 11 '21 at 21:53
  • I think the proxy_pass workaround only works if you use proxy_pass requests. If you need to use fastcgi it will not work. – Oskar Kossuth May 13 '21 at 12:26
  • I am not using the fastcgi. What I did is I basically changed `location /api/ { proxy_pass http://backend:3000/; }` into `location /api/ { <...>; set $var_backend "http://backend:3000"; proxy_pass http://$var_backend/; }` (skipped the `rewrite` part here that handles uris correctly) – Nikita Mendelbaum May 13 '21 at 14:32
  • 1
    The Stack Overflow bounty system made me award the answer by the end of bounty period, however, this does not answer the question. The problem still persists. Even with nginx 1.21.1. The `valid` setting does not do anything in the resolver directive. – Nikita Mendelbaum Sep 08 '21 at 08:47
  • 1
    it won't answer the question, but it's still good advice... Just switch from nginx to some true open source - alive project (like traefik). Nginx is known from weirdly bad documentation and lack of support for basic things unless you hack things around. Simple things should be simple and they promote nginx plus which is paid. – Piotr Sep 13 '22 at 22:54
3

TLDR: Your Internet Provider may be caching dnses with no respect to tiny TTL values (like 1 second).

I've been trying to retest locally the same thing.

  • Your docker might be using local resolver (127.0.0.11)
  • Then Dns might be cached by your OS (which you may clean - that's OS specific)
  • Then you might have it cached on your WIFI/router (yes!)
  • Later it goes to your ISP and is beyond your control.

But nslookup is your friend, you can query each dns server between nginx and root DNS server.

Something very easy to reproduce (without setting up local dns server)

Create route 53 'A' entry with TTL of 1 second and try to query AWS dns server in your hosted zone (it will be sth. like ns-239.awsdns-29.com) Play around with dig / nslookup command

nslookup
set type=a
server ns-239.awsdns-29.com
your.domain.com

It will return IP you have set

Change the Route53 'A' entry to some other ip.

use dig / nslookup and make sure you see changes immediately

Then set resolver in nginx to AWS dns name (for testing purposes only). If that works it means that DNS is cached elsewere and this is no longer nginx issue!

In my case it was sunrise WIFI router which began to see new IP only after I restarted it (I assume things would resolve after some longer value).

Great help when debugging this is when your nginx is compiled with

--with-debug

Then in nginx logs you see whether given dns was resolved and to what IP.

My whole config looks like this (here with standard docker resolver which has to be set if you are using variables in proxy_pass!)

        server {

                 listen 0.0.0.0:8888;
                 server_name nginx.my.custom.domain.in.aws;
                 resolver 127.0.0.11 valid=1s;

                 location / {
                     proxy_ssl_server_name on;
                     proxy_set_header X-Real-IP $remote_addr;
                     proxy_set_header X-Forwarded-Proto https;
                     proxy_set_header Host $host;
                     set $backend_servers my.custom.domain.in.aws;
                     proxy_pass https://$backend_servers$request_uri;
                 }
            }

Then you can try to test it with

 curl -L http://nginx.my.custom.domain.in.aws --resolve nginx.my.custom.domain.in.aws 0.0.0.0:8888
Piotr
  • 317
  • 1
  • 13
  • Thank you for your detailed answer. However I think the problem is really in the nginx. I tested it in many ways and it looks like something else in my nginx.conf prevents the `valid` parameter of the `resolver` directive from working properly. It is either the `limit_req_zone` or the `proxy_cache_path` directives used for request rate limiting and caching respectively. These just don't play nicely with the `valid` param for some reason. And I could not find any information about this anywhere in nginx docs. – Nikita Mendelbaum Jan 11 '22 at 16:44
  • Also, I am not sure how is it related to the internet provider if it is all local, isolated within the nginx container. The Docker's local DNS works correctly, it is just the nginx itself that is failing to update the cache with the info from the Docker DNS. – Nikita Mendelbaum Jan 11 '22 at 17:21
  • When testing this locally (via minikube so having cluter locally) my dns chain looked like: dns inside docker -> dns on my router -> ISP -> unknown-path -> Amazon. After fixing nginx I still had issue when testing this on minikube and my (second) problem was wifi router - and yes, It was absolutely not what I was expecting. In your case I would really compile nginx with `--with-debug` flag. Logs are very verbose then. – Piotr Jan 12 '22 at 16:30
1

Was struggling on the same thing exactly for the same thing (Docker Swarm) and actually to make it work I required to let the upstream away from my configuration.

Something that works well (tested 5' ago on NGINX 2.22) :

location ~* /api/parameters/(.*)$ {
    resolver 127.0.0.11 ipv6=off valid = 1s;
    set $bck_parameters parameters:8000;
    proxy_pass http://$bck_parameters/api/$1$is_args$args;
}

where $bck_parameters is NOT an upstream but the real server behind. Doing same thing with upstream will fail.

Arnaud F.
  • 8,252
  • 11
  • 53
  • 102
0

After a long search I found some solution for uwsgi_pass. The same should work for proxy_pass.

    resolver 127.0.0.11 valid=10s;
    set $upstream_endpoint ${UWSGI_ADDR};
    location / {
        uwsgi_pass $upstream_endpoint;
        include uwsgi_params;
    }

where UWSGI_ADDR is the name of your application container with port, e.g. app:8000.

UPD:

In fact, it follows from proxy_pass documentaiton.

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.

Also you can find some useful information in section "Setting the Domain Name in a Variable" in the blog authored by one of the nginx developers.