1


I'm trying to setup simple mesos cluster on 2 virtual machines. IPs are:

  • 10.10.0.102 (with 1 master and 1 slave)- FQDN mesos1.mydomain
  • 10.10.0.103 (with 1 slave)- FQDN mesos2.mydomain

I'm using mesos 0.27.1 (rpm's downloaded from Mesosphere) and CentOS Linux release 7.1.1503 (Core).

I was successful in deploying 1 node cluster (10.10.0.102): master and slave works and I can deploy and scale some simple application via marathon.

The problem comes when I try to start second slave on 10.10.0.103. Always, when I start that slave its state is deactivated.

Logs from slave on 10.10.0.103:

I0226 13:49:58.428019 14937 slave.cpp:463] Slave resources: cpus(*):1; mem(*):2768; disk(*):3409; ports(*):[31000-32000]
I0226 13:49:58.428019 14937 slave.cpp:471] Slave attributes: [  ]
I0226 13:49:58.428019 14937 slave.cpp:476] Slave hostname: mesos2
I0226 13:49:58.430469 14946 state.cpp:58] Recovering state from '/tmp/mesos/meta'
I0226 13:49:58.430922 14947 status_update_manager.cpp:200] Recovering status update manager
I0226 13:49:58.430954 14947 containerizer.cpp:390] Recovering containerizer
I0226 13:49:58.432219 14947 provisioner.cpp:245] Provisioner recovery complete
I0226 13:49:58.432273 14947 slave.cpp:4495] Finished recovery
I0226 13:49:58.448940 14948 group.cpp:349] Group process (group(1)@10.10.0.103:5051) connected to ZooKeeper
I0226 13:49:58.449050 14948 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0226 13:49:58.449064 14948 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0226 13:49:58.451846 14948 detector.cpp:154] Detected a new leader: (id='3')
I0226 13:49:58.451937 14948 group.cpp:700] Trying to get '/mesos/json.info_0000000003' in ZooKeeper
I0226 13:49:58.453397 14948 detector.cpp:479] A new leading master (UPID=master@10.10.0.102:5050) is detected
I0226 13:49:58.453459 14948 slave.cpp:795] New master detected at master@10.10.0.102:5050
I0226 13:49:58.453698 14948 slave.cpp:820] No credentials provided. Attempting to register without authentication
I0226 13:49:58.453724 14948 slave.cpp:831] Detecting new master
I0226 13:49:58.453743 14948 status_update_manager.cpp:174] Pausing sending status updates
I0226 13:50:58.445101 14948 slave.cpp:4304] Current disk usage 22.11%. Max allowed age: 4.752451232032847days
I0226 13:51:58.460233 14948 slave.cpp:4304] Current disk usage 22.11%. Max allowed age: 4.752451232032847days

Logs from master on 10.10.0.102

I0226 22:55:14.240464  2021 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 682
I0226 22:55:14.240542  2021 hierarchical.cpp:473] Added slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 (mesos2) with cpus(*):1; mem(*):2768; disk(*):3409; ports(*):[31000-32000] (allocated: )
I0226 22:55:14.240671  2021 master.cpp:5350] Sending 1 offers to framework c5a5818d-16fa-42bf-8e73-697a2d12fe97-0001 (marathon) at scheduler-91034353-1820-4020-aad1-10e11d567136@10.10.0.102:45698
I0226 22:55:14.240767  2021 replica.cpp:537] Replica received write request for position 682 from (1259)@10.10.0.102:5050
E0226 22:55:14.241082  2027 process.cpp:1966] Failed to shutdown socket with fd 32: Transport endpoint is not connected
I0226 22:55:14.241143  2019 master.cpp:1172] Slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)@10.10.0.103:5051 (mesos2) disconnected
I0226 22:55:14.241153  2019 master.cpp:2633] Disconnecting slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)@10.10.0.103:5051 (mesos2)
I0226 22:55:14.241161  2019 master.cpp:2652] Deactivating slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 at slave(1)@10.10.0.103:5051 (mesos2)
I0226 22:55:14.241230  2019 hierarchical.cpp:560] Slave a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-S167 deactivated
I0226 22:55:14.245923  2019 master.cpp:3673] Processing DECLINE call for offers: [ a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-O1251 ] for framework c5a5818d-16fa-42bf-8e73-697a2d12fe97-0001 (marathon) at scheduler-91034353-1820-4020-aad1-10e11d567136@10.10.0.102:45698
W0226 22:55:14.245923  2019 master.cpp:3720] Ignoring decline of offer a61e9d9f-f85b-4c72-9780-166a7ffc0ac3-O1251 since it is no longer valid
I0226 22:55:14.249065  2021 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 8.264893ms
I0226 22:55:14.249107  2021 replica.cpp:712] Persisted action at 682
I0226 22:55:14.249220  2021 replica.cpp:691] Replica received learned notice for position 682 from @0.0.0.0:0

I've tried to start slave using two approaches (on 10.10.0.103):

  • sudo service mesos-slave start
  • mesos-slave --master=10.10.0.102:5050 --ip=10.10.0.103

Both give me the same results.

Additionally in MESOS-SLAVE.WARNING I see also:

Running on machine: mesos2.mydomain
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0226 13:49:58.415089 14937 systemd.cpp:244] Required functionality `Delegate` was introduced in Version `218`. Your system may not function properly; however since some distributions have patched systemd packages, your system may still be functional. This is why we keep running. See MESOS-3352 for more information

Base on similar topics I see that this can be related to network configuration so below is some info about:

hosts file on 10.10.0.102

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.0.103 mesos2 mesos2.mydomain
10.10.0.102 mesos1 mesos1.mydomain

hosts file on 10.10.0.103

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.0.102 mesos1 mesos1.mydomain
10.10.0.103 mesos2 mesos2.mydomain

both VM's have 2 network interfaces (without loopback). Below comes from 10.10.0.103- on 10.10.0.102 is similar:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:49:76:48 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3
       valid_lft 75232sec preferred_lft 75232sec
    inet6 fe80::a00:27ff:fe49:7648/64 scope link
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:d9:24:2a brd ff:ff:ff:ff:ff:ff
    inet 10.10.0.103/24 brd 10.10.0.255 scope global enp0s8
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fed9:242a/64 scope link
       valid_lft forever preferred_lft forever

Both VMs have network connectivity.

from 10.10.0.102 to 10.10.0.103

[root@mesos1 ~]# ping mesos2.mydomain
PING mesos2 (10.10.0.103) 56(84) bytes of data.
64 bytes from mesos2 (10.10.0.103): icmp_seq=1 ttl=64 time=0.578 ms
64 bytes from mesos2 (10.10.0.103): icmp_seq=2 ttl=64 time=0.616 ms

from 10.10.0.103 to 10.10.0.102

[root@mesos2 ~]# ping mesos1.mydomain
PING mesos1 (10.10.0.102) 56(84) bytes of data.
64 bytes from mesos1 (10.10.0.102): icmp_seq=1 ttl=64 time=0.441 ms
64 bytes from mesos1 (10.10.0.102): icmp_seq=2 ttl=64 time=0.972 ms

All help would be highly appreciate. Regards
Andrzej

awenclaw
  • 373
  • 5
  • 20
  • What is your ZooKeeper configuration of the Mesos Master? And the second slave respectively? Have a look in the systemd unit of the `mesos-slave` service... – Tobi Mar 01 '16 at 08:04
  • @Tobi for mesos master: `[Unit] Description=Mesos Master After=network.target Wants=network.target [Service] ExecStart=/usr/bin/mesos-init-wrapper master Restart=always RestartSec=20 LimitNOFILE=16384 [Install] WantedBy=multi-user.target ` for second-slave (that can't connect): `[Unit] Description=Mesos Slave After=network.target Wants=network.target [Service] ExecStart=/usr/bin/mesos-init-wrapper slave KillMode=process Restart=always RestartSec=20 LimitNOFILE=16384 CPUAccounting=true MemoryAccounting=true [Install] WantedBy=multi-user.target ` – awenclaw Mar 02 '16 at 14:05
  • not sure it this will be helpfull but additionally: cat /etc/mesos/zk `zk://10.10.0.102:2181/mesos` – awenclaw Mar 02 '16 at 14:17
  • I hope this post helps you: http://stackoverflow.com/questions/31858937/transport-endpoint-not-connected-mesos-slave-master – kovit nisar Mar 03 '16 at 04:53
  • @kovitnisar I just tried that.I think his problem is different. First response is about multi master environment- I have one. And error is different (not sure if this matter): mine: `Failed to shutdown socket with fd 37: Transport endpoint is not connected`, his: ` Shutdown failed on fd=13: Transport endpoint is not connected`. Second answer is about migrating from between versions. Anyway I did: 1.) vi /etc/default/mesos-master -> MESOS_QUORUM=1, 2.) clean mesos default working dir: rm -rf /tmp/mesos/ 3.) clean zookeper logs: ./zkCli.sh -> rmr /mesos. Unfortunetly I recived same error. – awenclaw Mar 03 '16 at 10:58
  • Did you restarted zk after cleaning? Do it in following order: `stop mesos-slave → stop mesos-master → stop zk → start zk → start mesos-master → start mesos slave` – janisz Mar 03 '16 at 12:06
  • @janisz yes I did. Please see below exact command order on both nodes (master & slave): **master-node:** `service marathon stop ; service chronos stop ; service mesos-slave stop ; service mesos-master stop ; service zookeeper-server stop ` **slave-node:** `service mesos-slave stop ` **master node:** `rm -rf /tmp/mesos/ ; service zookeeper-server start ; /usr/lib/zookeeper/bin/zkCli.sh -> rmr /mesos -> quit ; service mesos-master start ; service mesos-slave start` **slave-node:** `service mesos-slave start` After executing all of that second slave is deactivated again. – awenclaw Mar 03 '16 at 13:46

1 Answers1

0

Like always the simplest answers are the best. It's turn out that I had running iptables on slave node. Disabling this resolve my problem:

systemctl disable firewalld systemctl stop firewalld

Thanks everyone for help!

awenclaw
  • 373
  • 5
  • 20