I am seeing the same problem as described here and here. I have tried everything that worked in those two cases to no avail - I still see the same behavior. Can someone offer alternatives I might try?
My setup:
I am running 3 Centos 7.2 boxes. Network Time Protocol (ntpd) running on all machines. All have been yum updated. Here is some detailed info:
Linux version 3.10.0-327.28.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) )
Docker version:
# docker version
Client:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built:
OS/Arch: linux/amd64
Server:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built:
OS/Arch: linux/amd64
Setup the swarm manager:
>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster
// on some retry attempts (after 'docker swarm leave --force') I ran:
>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster
Manager status:
>docker node inspect self
[
{
"ID": "3x5q1n9v956g3ptdle2eve856",
"Version": {
"Index": 10
},
"CreatedAt": "2016-08-27T13:01:13.400345797Z",
"UpdatedAt": "2016-08-27T13:01:13.580143388Z",
"Spec": {
"Role": "manager",
"Availability": "active"
},
"Description": {
"Hostname": "mymanagerhost.mycompany.com",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 16659128320
},
"Engine": {
"EngineVersion": "1.12.1",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready"
},
"ManagerStatus": {
"Leader": true,
"Reachability": "reachable",
"Addr": "10.1.1.40:2377"
}
}
]
On the worker node (I have two, but they both behave the same).
Join Swarm:
>docker swarm join --token SWMTKN-1-4fjh7kncdpwjvxnxisamhldgenmmnqyvhnx9qdi8d4hkkfuacv-168gs9okd5ck0r4lokdgpef92 10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
Output of Docker info command:
>docker info
Plugins:
Volume: local
Network: null host bridge overlay
Swarm: pending
NodeID:
Error: rpc error: code = 1 desc = context canceled
Is Manager: false
Node Address: 10.1.1.50
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.52 GiB
Name: myWorkerNode.mycompany.com
ID: DAWE:VDRA:ZUVS:P7PH:ADCP:MFNU:2LOS:C6TG:XSIS:Y7EX:I46S:KFXT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Edit per first answer below
So I tried leaving with stop/start surrounding commands. I did:
# docker swarm leave --force
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop docker.service
#
# service docker start
Redirecting to /bin/systemctl start docker.service
# docker swarm init --advertise-addr 10.1.1.40:2377
Swarm initialized: current node (0e0y2k2hngnwyeg86ilzbrjmu) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc-7v48qak22wd03y3jyv903a9if \
10.1.1.40:2377
Then on the worker I did:
# docker swarm leave
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop docker.service
# service docker start
Redirecting to /bin/systemctl start docker.service
# docker swarm join \
> --token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc- 7v48qak22wd03y3jyv903a9if \
> 10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
Which is obviously the same behavior...
UPDATE
I have tried all the steps outlined by @Miad Abrin. I still get the same behavior. I am guessing the cause is related to the CERTS errors I see when I do:
# journalctl -xe
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.554904435-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555400400-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555478782-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555528929-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555685464-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Does anyone know the cause of this and how to correct?