19

I'm trying to connect a Mesos slave to its master. Whenver the slave tries to connect to the master, I get the following message:

I0806 16:39:59.090845   935 hierarchical.hpp:528] Added slave 20150806-163941-1027506442-5050-921-S3 (debian) with cpus(*):1; mem(*):1938; disk(*):3777; ports(*):[31000-32000] (allocated: )
E0806 16:39:59.091384   940 socket.hpp:107] Shutdown failed on fd=25: Transport endpoint is not connected [107]
I0806 16:39:59.091508   940 master.cpp:3395] Registered slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) with cpus(*):1; mem(*):1938; disk(*):3777; ports(*):[31000-32000]
I0806 16:39:59.091747   940 master.cpp:1006] Slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) disconnected
I0806 16:39:59.091868   940 master.cpp:2203] Disconnecting slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian)
I0806 16:39:59.092031   940 master.cpp:2222] Deactivating slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian)
I0806 16:39:59.092248   939 hierarchical.hpp:621] Slave 20150806-163941-1027506442-5050-921-S3 deactivated

The error seems to be:

E0806 16:39:59.091384 940 socket.hpp:107] Shutdown failed on fd=25: Transport endpoint is not connected [107]

The host was started using:

./mesos-master.sh --ip=10.129.62.61 --work_dir=~/Mesos/mesos-0.23.0/workdir/ --zk=zk://10.129.62.61:2181/mesos --quorum=1

And the slave

./mesos-slave.sh --master=zk://10.129.62.61:2181/mesos

If I run the slave on the same VM as the host it's working fine.

I couldn't find much information on the internet. I'm running two virtual boxes (Debian 8.1) on VirtualBox 5. The host is a windows 7.

Edit 1:

The master and the slave both run on a dedicated VM.

Both VMs nextorks are configured using bridged network.

ifconfig from master:

eth0      Link encap:Ethernet  HWaddr 08:00:27:cc:6c:6e
          inet addr:10.129.62.61  Bcast:10.129.255.255  Mask:255.255.0.0
          inet6 addr: fe80::a00:27ff:fecc:6c6e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5335953 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1422428 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:595886271 (568.2 MiB)  TX bytes:362423868 (345.6 MiB)

ifconfig from slave:

eth0      Link encap:Ethernet  HWaddr 08:00:27:56:83:20
          inet addr:10.129.62.49  Bcast:10.129.255.255  Mask:255.255.0.0
          inet6 addr: fe80::a00:27ff:fe56:8320/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4358561 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3825 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:397126834 (378.7 MiB)  TX bytes:354116 (345.8 KiB)

Edit 2:

The slave logs can be found at http://pastebin.com/CXZUBHKr

The master logs can be found at http://pastebin.com/thYR1par

benjamin.d
  • 2,801
  • 3
  • 23
  • 35
  • 1
    Can you show output of `ipconfig` on your slave? I suspect it register to master with wrong IP – janisz Aug 10 '15 at 14:09
  • 1
    I edited the question – benjamin.d Aug 11 '15 at 07:38
  • 2
    I am suggesting based on the [getting started document](http://mesos.apache.org/gettingstarted/). Instead of `./mesos-slave.sh --master=zk://10.129.62.61:2181/mesos`, can you try `./mesos-slave.sh --master=10.129.62.61:5050` and see if that works? Because mesos-master process is listening on port 5050. – Dharmit Aug 11 '15 at 12:39
  • I get the same error – benjamin.d Aug 11 '15 at 13:05

4 Answers4

12

I had a similar problem. My slave logs would be filled with

    E0812 15:58:04.017990  2193 socket.hpp:107] Shutdown failed on fd=13: Transport endpoint is not connected [107]

My master would have

    F0120 20:45:48.025610 12116 master.cpp:1083] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins

And the master would die, and a new election would occur, the killed master would be restarted by upstart (I am on a Centos 6 box) and be added into the pool of potential masters. Thus my elected master would daisy chain around my master nodes. Many restarts of masters and slaves did nothing the problem would consistently return within 1 minute of master election.

The solution for me came from a this stackoverflow question (thanks) and a hint in a github gist note.

The gist of it is /etc/default/mesos-master must specify a quorum number (it needs to be correct for the number of mesos masters, in my case 3)

    MESOS_QUORUM=2

This seems odd to me as I have the same information in the file /etc/mesos-master/quorum

But I added it to /etc/default/mesos-master restarted the mesos-masters and slaves and the problem has not returned.

I hope this helps you.

Community
  • 1
  • 1
  • 1
    To avoid duplication, you can set it to `MESOS_QUORUM=\`cat /etc/mesos-master/quorum\`` instead – kbolino Apr 02 '16 at 03:56
  • This really did the trick! Notably, one really needs to set `MESOS_QUORUM` instead of just `QUORUM` which one might think, because all the other settings don't need a `MESOS_` prefix. Strange... Seems like bug to me. – Tobi Apr 11 '16 at 13:59
  • Note: Slave disconnects can also be caused by an incorrect bind ip address setting in /etc/default/mesos. See: https://marc.info/?l=mesos-user&m=142539883727970&w=2 – Jay Taylor Jun 13 '16 at 01:22
2

I've run into this error in the logs when upgrading mesos versions (e.g. 0.20.0 -> 0.27.0). Sometimes the data from the previous version is incompatible with other versions.

Here is how I remedied it:

First ensure all nodes have the mesos-master service stopped:

sudo service mesos-master stop

Then clear out all potential old data:

  1. Remove $MESOS_WORK_DIR (/var/mesos in my case):

    sudo rm -rf /var/mesos
    
  2. Clear our mesos data in ZooKeeper:

    $ zkCli.sh
    WatchedEvent state:SyncConnected type:None path:null
    [zk: localhost:2181(CONNECTED) 0] rmr /mesos
    [zk: localhost:2181(CONNECTED) 0] quit
    Quitting...
    

After doing these steps I started the mesos-master service on all nodes and it came back online.

Tombart
  • 30,520
  • 16
  • 123
  • 136
Jay Taylor
  • 13,185
  • 11
  • 60
  • 85
  • This worked for me, but I just needed to delete `/data/tmp/mesos/replicated_log/` on all the masters, instead of the entire work dir, then also the zookeeper /mesos node. This is actually documented here: http://mesos.apache.org/documentation/latest/operational-guide/ (increasing the quorum size) – Vincenzo Pii Feb 08 '17 at 15:28
2
I0806 16:39:59.091747   940 master.cpp:1006] Slave 20150806-163941-1027506442-5050-921-S3 at slave(1)@127.0.1.1:5051 (debian) disconnected

This is the error hint.

Your slave expose the wrong IP.

Append --ip=10.129.62.49 to the slave command and it works.

noob
  • 111
  • 7
0

Run the slave with --ip=10.129.62.49 instead

Matt
  • 74,352
  • 26
  • 153
  • 180
hartem
  • 411
  • 2
  • 8