Spark job can not acquire resource from mesos cluster

Question

I am using Spark Job Server (SJS) to create context and submit jobs.

My cluster includes 4 servers.

master1: 10.197.0.3
master2: 10.197.0.4
master3: 10.197.0.5
master4: 10.197.0.6

But only master1 has a public ip.

First of all I set up zookeeper for master1, master3 and master3 and zookeeper-id from 1 to 3. I intend use master1, master2, master3 to be a masters of cluster. That mean quorum=2 I set for 3 masters. The zk connect is zk://master1:2181,master2:2181,master3:2181/mesos each server I also start mesos-slave so I have 4 slaves and 3 masters.

As you can see all slaves are conencted. But the funny thing is when I create a job to run it can not acquire the resource.

From logs I saw that it's continuing DECLINE the offer. This logs from master.

I0523 15:01:00.116981 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4264 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:00.117086 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4265 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:01.460502 32508 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (914)@127.0.0.1:5050
I0523 15:01:02.117753 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.118099 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.119299 32508 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4266 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.119858 32515 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4267 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.900946 32509 http.cpp:312] HTTP GET for /master/state from 10.197.0.3:35778 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' with X-Forwarded-For='113.161.38.181'
I0523 15:01:03.118147 32514 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765

For 1 of my slave I check

W0523 14:53:15.487599 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019
W0523 14:53:15.487773 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020
I0523 14:53:15.487820 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019 to master@10.197.0.3:5050
I0523 14:53:15.488008 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020 to master@10.197.0.3:5050
I0523 15:02:24.120436 32680 http.cpp:190] HTTP GET for /slave(1)/state from 113.161.38.181:63097 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
W0523 15:02:24.165690 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523111535-0003' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019: Container 'cac7667c-3309-4380-9f95-07d9b888e44e' not found
W0523 15:02:24.165771 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523112724-0001' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020: Container '9c661311-bf7f-4ea6-9348-ce8c7f6cfbcb' not found

From SJS Logs

[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4565 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4566 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4567 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,366] WARN  cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[2016-05-23 15:04:10,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4568 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4569 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:11,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:12,308] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4570 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:12,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0

In master2 logs

May 23 08:19:44 ants-vps mesos-master[1866]: E0523 08:19:44.273349  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:19:54 ants-vps mesos-master[1866]: I0523 08:19:54.274245  1899 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1257)@127.0.0.1:5050
May 23 08:19:54 ants-vps mesos-master[1866]: E0523 08:19:54.274533  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:20:04 ants-vps mesos-master[1866]: I0523 08:20:04.275291  1897 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1260)@127.0.0.1:5050
May 23 08:20:04 ants-vps mesos-master[1866]: E0523 08:20:04.275512  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected

From master3:

May 23 08:21:05 ants-vps mesos-master[22023]: I0523 08:21:05.994082 22042 recover.cpp:193] Received a recover response from a replica in EMPTY status
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994051 22043 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994529 22036 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (1282)@127.0.0.1:5050

How to find the reason of that issues and fix it?

The tasks are being declined because you are asking for more resource than Mesos can offer. Try running the Spark job with the absolute minimum amount of resource possible and kill the two frameworks you have running first. — Binary Nerd, May 23 '16 at 08:25
As you can see I have 32 core and ~200GB RAM. My context just create 1 for 18 cores and 50GB RAM. How it supposes to not enough resource? I already killed and created many times. — giaosudau, May 23 '16 at 08:27
Right, i'm just suggesting you try and use a minimal amount of resource instead of all of it, which will help determine if the problem is related to how much resource you're requesting. — Binary Nerd, May 23 '16 at 08:28
Yeah. I've created a context with 6 cores and 20G memory It is fine. But That mean the others slave was not connected right? BTW. Other slave is free, I am not running anything in those servers. — giaosudau, May 23 '16 at 08:39
I tried use 20 cores it fine. But don't know why I can't use all resource? — giaosudau, May 23 '16 at 09:13
Are you setting the IP addresses for master2 and master3? It seems they use the `127.0.0.1` address. Try setting the `LIBPROCESS_IP` or use the `--ip` flag when starting the masters. — Tobi, May 23 '16 at 09:35

Spark job can not acquire resource from mesos cluster

0 Answers0