I am using Spark Job Server (SJS) to create context and submit jobs.
My cluster includes 4 servers.
master1: 10.197.0.3
master2: 10.197.0.4
master3: 10.197.0.5
master4: 10.197.0.6
But only master1 has a public ip.
First of all I set up zookeeper for master1, master3 and master3 and zookeeper-id from 1 to 3.
I intend use master1, master2, master3 to be a masters of cluster.
That mean quorum=2
I set for 3 masters.
The zk connect is zk://master1:2181,master2:2181,master3:2181/mesos
each server I also start mesos-slave so I have 4 slaves and 3 masters.
As you can see all slaves are conencted.
But the funny thing is when I create a job to run it can not acquire the resource.
From logs I saw that it's continuing DECLINE the offer. This logs from master.
I0523 15:01:00.116981 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4264 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:00.117086 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4265 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:01.460502 32508 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (914)@127.0.0.1:5050
I0523 15:01:02.117753 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.118099 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.119299 32508 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4266 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.119858 32515 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4267 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.900946 32509 http.cpp:312] HTTP GET for /master/state from 10.197.0.3:35778 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' with X-Forwarded-For='113.161.38.181'
I0523 15:01:03.118147 32514 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
For 1 of my slave I check
W0523 14:53:15.487599 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019
W0523 14:53:15.487773 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020
I0523 14:53:15.487820 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019 to master@10.197.0.3:5050
I0523 14:53:15.488008 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020 to master@10.197.0.3:5050
I0523 15:02:24.120436 32680 http.cpp:190] HTTP GET for /slave(1)/state from 113.161.38.181:63097 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
W0523 15:02:24.165690 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523111535-0003' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019: Container 'cac7667c-3309-4380-9f95-07d9b888e44e' not found
W0523 15:02:24.165771 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523112724-0001' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020: Container '9c661311-bf7f-4ea6-9348-ce8c7f6cfbcb' not found
From SJS Logs
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4565 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4566 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4567 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,366] WARN cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[2016-05-23 15:04:10,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4568 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4569 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:11,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:12,308] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4570 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:12,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
In master2 logs
May 23 08:19:44 ants-vps mesos-master[1866]: E0523 08:19:44.273349 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:19:54 ants-vps mesos-master[1866]: I0523 08:19:54.274245 1899 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1257)@127.0.0.1:5050
May 23 08:19:54 ants-vps mesos-master[1866]: E0523 08:19:54.274533 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:20:04 ants-vps mesos-master[1866]: I0523 08:20:04.275291 1897 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1260)@127.0.0.1:5050
May 23 08:20:04 ants-vps mesos-master[1866]: E0523 08:20:04.275512 1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
From master3:
May 23 08:21:05 ants-vps mesos-master[22023]: I0523 08:21:05.994082 22042 recover.cpp:193] Received a recover response from a replica in EMPTY status
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994051 22043 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994529 22036 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (1282)@127.0.0.1:5050
How to find the reason of that issues and fix it?