Docker 1.12.1: after swarm init, workers unable to join swarm

Question

I am seeing the same problem as described here and here. I have tried everything that worked in those two cases to no avail - I still see the same behavior. Can someone offer alternatives I might try?

My setup:

I am running 3 Centos 7.2 boxes. Network Time Protocol (ntpd) running on all machines. All have been yum updated. Here is some detailed info:

Linux version 3.10.0-327.28.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) )

Docker version:

# docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        
 OS/Arch:      linux/amd64

Setup the swarm manager:

>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster
// on some retry attempts (after 'docker swarm leave --force') I ran:
>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster

Manager status:

>docker node inspect self
[
{
    "ID": "3x5q1n9v956g3ptdle2eve856",
    "Version": {
        "Index": 10
    },
    "CreatedAt": "2016-08-27T13:01:13.400345797Z",
    "UpdatedAt": "2016-08-27T13:01:13.580143388Z",
    "Spec": {
        "Role": "manager",
        "Availability": "active"
    },
    "Description": {
        "Hostname": "mymanagerhost.mycompany.com",
        "Platform": {
            "Architecture": "x86_64",
            "OS": "linux"
        },
        "Resources": {
            "NanoCPUs": 4000000000,
            "MemoryBytes": 16659128320
        },
        "Engine": {
            "EngineVersion": "1.12.1",
            "Plugins": [
                {
                    "Type": "Network",
                    "Name": "bridge"
                },
                {
                    "Type": "Network",
                    "Name": "host"
                },
                {
                    "Type": "Network",
                    "Name": "null"
                },
                {
                    "Type": "Network",
                    "Name": "overlay"
                },
                {
                    "Type": "Volume",
                    "Name": "local"
                }
            ]
        }
    },
    "Status": {
        "State": "ready"
    },
    "ManagerStatus": {
        "Leader": true,
        "Reachability": "reachable",
        "Addr": "10.1.1.40:2377"
    }
}
]

On the worker node (I have two, but they both behave the same).

Join Swarm:

>docker swarm join     --token SWMTKN-1-4fjh7kncdpwjvxnxisamhldgenmmnqyvhnx9qdi8d4hkkfuacv-168gs9okd5ck0r4lokdgpef92     10.1.1.40:2377

Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.

Output of Docker info command:

>docker info
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: pending
 NodeID: 
 Error: rpc error: code = 1 desc = context canceled
 Is Manager: false
 Node Address: 10.1.1.50
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.52 GiB
Name: myWorkerNode.mycompany.com
ID: DAWE:VDRA:ZUVS:P7PH:ADCP:MFNU:2LOS:C6TG:XSIS:Y7EX:I46S:KFXT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 127.0.0.0/8

Edit per first answer below

So I tried leaving with stop/start surrounding commands. I did:

# docker swarm leave --force
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop  docker.service
# 
# service docker start
Redirecting to /bin/systemctl start  docker.service

# docker swarm init --advertise-addr 10.1.1.40:2377
Swarm initialized: current node (0e0y2k2hngnwyeg86ilzbrjmu) is now a manager.

To add a worker to this swarm, run the following command:
docker swarm join \
    --token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc-7v48qak22wd03y3jyv903a9if \
10.1.1.40:2377

Then on the worker I did:

# docker swarm leave
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop  docker.service
# service docker start
Redirecting to /bin/systemctl start  docker.service
# docker swarm join \
>     --token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc-    7v48qak22wd03y3jyv903a9if \
    >     10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined.     Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.

Which is obviously the same behavior...

UPDATE

I have tried all the steps outlined by @Miad Abrin. I still get the same behavior. I am guessing the cause is related to the CERTS errors I see when I do:

# journalctl -xe
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.554904435-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555400400-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555478782-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555528929-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555685464-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc

Does anyone know the cause of this and how to correct?

Try simply "docker swarm init --advertise-addr 10.1.1.40" without the trailing port number. Most importantly can the worker nodes see that IP address? No firewalls or other stuff potentially blocking the path — Mark O'Connor, Aug 28 '16 at 14:14
I believe your token is incorrect. Check for the space after `uacv-` — Bernard, Aug 28 '16 at 14:16
@MarkO'Connor No firewalls/proxies or any such stuff. The 'nmap' program from the worker 'sees' the open port on the manager host — JoeG, Aug 28 '16 at 22:29
@Alkaline that is just SO cut and paste. I tried to correct, but the actual commands I execute are directly copied and pasted from one window to the other. — JoeG, Aug 29 '16 at 00:38

score 1 · Answer 1 · edited Aug 29 '16 at 14:45

You need to restart your docker daemon service before leaving the swarm and after it. do it both for the swarm leader and the works. This is a bug in 1.12 version and it is fixed in 1.12.1 since I had the same problems.

My Results when trying this

In the two sections below I numbered the steps with (num) to show the order between the worker and the manager:

On the worker:

(1)# docker swarm leave --force
Error response from daemon: This node is not part of a swarm
(2)# service docker stop
Redirecting to /bin/systemctl stop  docker.service
(6)# service docker start
Redirecting to /bin/systemctl start  docker.service
# 
(7)# docker swarm join \
>     --token SWMTKN-1-4gsdy8jshxmd58mvpcm0tlmbbnrrqdrf51ggcwvdv0bvkltxmy-am9o4dsl4ovx6b4lbsabn0fc7 \
>     10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background.     Use the "docker info" command to see the current swarm status of your node.
(8)# nmap -p2377 10.1.1.40

Starting Nmap 6.40 ( http://nmap.org ) at 2016-08-29 10:32 EDT
Nmap scan report for (10.1.0.123)
Host is up (0.00085s latency).
PORT     STATE    SERVICE
2377/tcp filtered unknown
MAC Address: 00:50:56:B9:76:32

On the manager node:

(3)# docker swarm leave --force
Error response from daemon: This node is not part of a swarm
(4)# service docker stop
Redirecting to /bin/systemctl stop  docker.service
(5)# service docker start
Redirecting to /bin/systemctl start  docker.service
(7)# docker swarm init --advertise-addr 10.1.1.40 --force-new-cluster
Swarm initialized: current node (7z52d3bcoiou61ltgike42dnn) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-4gsdy8jshxmd58mvpcm0tlmbbnrrqdrf51ggcwvdv0bvkltxmy-am9o4dsl4ovx6b4lbsabn0fc7 \
    10.1.1.40:2377

Edited my question with the results of my attempt to follow your suggestion - did I do that wrong? I also did 'yum upgrade' on docker on both machine to get to 1.12.1. Still the same behavior.... — JoeG, Aug 28 '16 at 23:08
please also remove the worker node from the master after leaving the swarm . with `docker node rm` after doing that again restart the docker service — Miad Abrin, Aug 29 '16 at 04:08
As expected, once the master is removed, all the 'node' commands fail. Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. — JoeG, Aug 29 '16 at 11:11
that's actually good. just do a swarm init `swarm init --force-new-cluster` on your master . then start adding nodes to it again. also do a force leave on nodes with `docker swarm leave --force` — Miad Abrin, Aug 29 '16 at 13:38
I tried (again) your solution. I wasn't sure where the best place to include the results, so I edited your answer and put them there. I still get the same results, so if you could look at that and tell me if you see anything amiss that would be great. — JoeG, Aug 29 '16 at 14:47
do this step by step.1. go to worker 2. swarm leave --force 3. service docker restart 4. go to master 5. docker node rm worker 6. service docker restart 7. docker swarm init --force-new-cluster.8.go to worker 9.run docker swarm join with new token — Miad Abrin, Aug 29 '16 at 15:40

Docker 1.12.1: after swarm init, workers unable to join swarm

1 Answers1

Linked