5

I'm playing around with Docker swarm,
I have a three nodes cluster, 1 manager, and 2 worker nodes. I'm using VIP for all my services.

I had a weird situation where I restarted the worker node.
I executed docker node ls and the worker node was Ready.
docker service ls would show me that the replications of the containers in the worker were good.
The problem: I couldn't join the node though the ingress network. No container in other nodes was able to access a container in that worker node.

I checked the containers they were all joining the ingress network.
I curled the containers from within the same node and they responded.
I pinged the service name (in the same malfunctioning node) from a container and it worked.
I curled the worker containers in the worker from the manager doesn't work!!
I curled with the ip address of the worker and they responded.

I restarted the worker node, but the issue persisted, then I restarted the whole cluster and it worked again!

Is there any explanation to what I just witnessed ?
I'm most worried that this would happen in a production environnement.

Thank you in advance.

Mehdi
  • 582
  • 4
  • 14

1 Answers1

5

This happens when overlay networking ports are not opened between the nodes (both workers and managers). From Docker's documentation, the following ports need to be opened:

  • TCP port 2377 for cluster management communications
  • TCP and UDP port 7946 for communication among nodes
  • UDP port 4789 for overlay network traffic

This may be blocked by iptables on either end, a network router/firewall in the middle, and even tools like VMWare NSX. To verify connectivity is working end to end, you can run tcpdump on the selected ports at each node and ensure that requests leaving one node arrive at the other.

Relevant iptables rules for every node in the cluster are:

iptables -I INPUT -p tcp -m tcp --dport 2376 -j ACCEPT
iptables -I INPUT -p tcp -m tcp --dport 2377 -j ACCEPT
iptables -I INPUT -p tcp -m tcp --dport 7946 -j ACCEPT
iptables -I INPUT -p udp -m udp --dport 7946 -j ACCEPT
iptables -I INPUT -p tcp -m udp --dport 4789 -j ACCEPT
iptables -I INPUT -p 50 -j ACCEPT # allows ipsec when secure overlay is enabled

If you are unable to adjust the firewall settings, swarm mode may be configured with a different overlay networking port from 4789 with docker swarm init --data-path-port

BMitch
  • 231,797
  • 42
  • 475
  • 450
  • Thank you for the response. The servers could communicate, and the manager would deploy new services into that worker node, the issue is routing trafic to that node. Not sure it's a firewall thing, as I restarted the manager and it worked. Unfortunately I don't have this setup anymore, only asked to see if this already happened or if anyone can share his experience. Thank you again – Mehdi Apr 07 '20 at 18:36
  • Overlay networking ports would not affect the ability to schedule workloads on nodes. – BMitch Apr 07 '20 at 18:38
  • Here are other's having similar issues: https://stackoverflow.com/a/60442952/596285, https://stackoverflow.com/a/60497618/596285 – BMitch Apr 07 '20 at 18:43
  • All docker trafik between swarm nodes (and traffic in ingress network) go via docker ports listed above. So you need to check that traffic can go via this ports in bought directions – Ryabchenko Alexander Apr 09 '20 at 18:16
  • 1
    Having a similar issue, but I don't think it is related to ports/firewall. I have 1 manager and 1 worker node and has a firewall. Tried the suggested iptables above, but that did not work. The only thing that worked was restarting both nodes and then recreating the swarm. Then the ingress network finally started showing up on the worker. This occurred a couple days ago and I though it was a fluke, but now it just occurred again. Prior to this, things had been running fine for months. – Andrew Schlei May 08 '20 at 21:26
  • Happened again, I checked everything you said, everything is in place, but the ingress network doesn't work ! – Mehdi Jul 14 '20 at 20:38
  • Run a tcpdump on each node involved, filtering on the overlay required ports listed above, and verify packets are leaving one node and making it to the target. – BMitch Jul 14 '20 at 22:04
  • Not leaving nor received by the target nodes. I haven't changed anything in the setup. – Mehdi Jul 16 '20 at 12:39
  • @Mehdi if it helps, I haven't changed anything in your setup either. :) – BMitch Jul 16 '20 at 13:37
  • lool, no offense, I think you got me wrong. Thank you @BMitch for you help, I'm just wondering if you would have any insights – Mehdi Jul 16 '20 at 22:31
  • 1
    @Mehdi the only way I've been able to track these issues down in the past has been debugging the networking between the nodes. tcpdump and wireshark tends to be the fastest way to narrow down issues. If you want to debug networking inside of a container, there's netshoot: https://github.com/nicolaka/netshoot – BMitch Jul 17 '20 at 13:30