4

I set up small test Mesosphere cluster according to this guide https://dcos.io/docs/1.8/administration/installing/custom/cli/ everything went smoothly. There are only 3 nodes in cluster, one for bootstrap, one master(10.7.1.12) and one agent(10.7.1.13).

But after rebooting machine with agent node, it is no longer visible by master node enter image description here.

In /var/log/mesos/mesos-agent.log last input has timestamp before reboot. I was trying all steps from https://dcos.io/docs/1.8/administration/installing/custom/troubleshooting/ but nothing changed.

Here are the logs from master after agent disconnection (sudo journalctl -u dcos-mesos-master)

lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556001  2671 master.cpp:1245] Agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13) disconnected
lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556089  2671 master.cpp:2784] Disconnecting agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13)
lut 06 15:48:14 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:48:14.556170  2671 master.cpp:2803] Deactivating agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.926198  2670 master.cpp:5334] Shutting down agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13) with message 'health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926230  2670 master.cpp:6617] Removing agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13): health check timed out
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926507  2670 master.cpp:6910] Removing task 93f4b075-1338-4a84-afd6-6932cfe44c30 with resources mem(arangodb31, arangodb3):2048; cpus(arangodb31, arangodb3):0.25; disk(arangodb31, arangodb3)[AGENCY_991972e5-2d83-4710-ba3c-de8cf02303ab:myPersistentVolume]:2048; ports(arangodb31, arangodb3):[1026-1026] of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 on agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.926695  2670 master.cpp:6910] Removing task 644b59eb-fb20-43fd-a7c1-b1d9406cbfcb with resources mem(arangodb3, arangodb3):2048; cpus(arangodb3, arangodb3):0.25; disk(arangodb3, arangodb3)[AGENCY_0c76702f-ae8b-423c-83a8-1b6e2af8b723:myPersistentVolume]:2048; ports(arangodb3, arangodb3):[1025-1025] of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 on agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 at slave(1)@10.7.1.13:5051 (10.7.1.13)
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928460  2670 master.cpp:6736] Removed agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13): health check timed out
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928472  2670 master.cpp:5197] Sending status update TASK_LOST for task 93f4b075-1338-4a84-afd6-6932cfe44c30 of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 'Slave 10.7.1.13 removed: health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928486  2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20@10.7.1.13:25366
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928611  2670 master.cpp:5197] Sending status update TASK_LOST for task 644b59eb-fb20-43fd-a7c1-b1d9406cbfcb of framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 'Slave 10.7.1.13 removed: health check timed out'
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928638  2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e@10.7.1.13:19866
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928747  2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0003 (arangodb3-standalone) at scheduler-180f6695-f3c9-4da6-80e8-d1dc633ec737@10.7.1.13:3583 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928761  2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0003 (arangodb3-standalone) at scheduler-180f6695-f3c9-4da6-80e8-d1dc633ec737@10.7.1.13:3583
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928894  2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e@10.7.1.13:19866 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928905  2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0002 (arangodb3) at scheduler-7870582e-becd-4747-aeba-0217e91d537e@10.7.1.13:19866
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928921  2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20@10.7.1.13:25366 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: W0206 15:53:16.928928  2670 master.hpp:2113] Master attempted to send message to disconnected framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0004 (arangodb3-1) at scheduler-f4f3a3f0-2261-4aaf-9390-81f4b1cc6d20@10.7.1.13:25366
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928941  2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0001 (metronome) at scheduler-4eb937a4-9a64-4a47-9245-3858defe691a@10.7.1.12:41077 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering
lut 06 15:53:16 arangodb2.test1.fgtsa.com mesos-master[2661]: I0206 15:53:16.928963  2670 master.cpp:6759] Notifying framework d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-0000 (marathon) at scheduler-02bf4e29-4dd7-4cf8-b14b-4a064b4d082c@10.7.1.12:43643 of lost agent d44b38bf-f35b-4c47-8dfb-d53b543b5e8f-S0 (10.7.1.13) after recovering

rest of the journals (`journalctl ...) are empty.

There is also this error in ZooKeeper logs enter image description here

I will be grateful for any suggestions on how to investigate it further.

EDIT:

I managed to run it agent node manually by starting dcos-mesos-slave service (before that I had to start dcos-spartan and dcos-gen-resolvconf services). Any ideas why it didn't start automatically?

Purple
  • 711
  • 2
  • 10
  • 19

1 Answers1

0

Any ideas why it didn't start automatically?

According to rules for using systemd reliably systemd units do not depend on each other so you need to start everything manually.

  • Requires=, Wants= are not allowed. If something that is depended upon fails, the thing depending on it will never try to be started again.
  • Before=, After= are discouraged. They are not strong guarantees, software needs to check that pre-requisites are up and working correctly
janisz
  • 6,292
  • 4
  • 37
  • 70