Apache Flink Resource Manager (App Master) fails allocating new Task Managers after the first one is killed after idle timeout

Question

I am using Apache Flink 1.9.1 on Apache Mesos 1.8.1 using more or less standard settings for both on a 3 nodes cluster. The first node (192.168.10.11) runs the Mesos master and the Flink App Master. Both other nodes (192.168.10.13, 192.168.10.14) are somewhat slave nodes running mesos agents and having access to a Flink Worker docker container. Flink is run from within Docker but Mesos Master and agents runs natively on host.

Everything works fine at first. The first batch job I submit to Flink is properly taken a the App Master, The Resource Manager negotiates successfully with Mesos and the Resource Manager Framework gets a new worker task dispatched on one of the slave nodes. The task completes successfully and the client program gets its results.

After a few minutes, the ResourceManager timeout occurs and the Flink Worker Task (Docker container) is killed due to idle timeout. So far so good.

But after that moment, If I submit again the very same job in the very same way than the first time, The Resource Manager does nothing and doesn't attempt to create any new worker. The job stays waiting for 5 minutes and fails after timeout.

If someone could point me to what may be going on wrong, I would be pretty happy.

Below an extract of the relevant logs of the Flink App Master and mesos master:

The startup of the Flink App Master

2019-10-24 06:36:05,191 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  Starting MesosSessionClusterEntrypoint (Version: 1.9.1, Rev:4d56de8, Date:30.09.2019 @ 11:32:19 CST)
2019-10-24 06:36:05,195 INFO  ....entrypoint.ClusterEntrypoint         -  OS current user: flink
2019-10-24 06:36:07,045 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Current Hadoop/Kerberos user: flink
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.222-b10
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  Maximum heap size: 2169 MiBytes
2019-10-24 06:36:07,504 INFO  ....entrypoint.ClusterEntrypoint         -  JAVA_HOME: (not set)
2019-10-24 06:36:07,519 INFO  ....entrypoint.ClusterEntrypoint         -  Hadoop version: 2.8.3
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  JVM Options:
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog.file=/usr/local/lib/flink-1.9.1/log/flink--mesos-appmaster-deb-node1.log
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlog4j.configuration=file:/usr/local/lib/flink-1.9.1/conf/log4j.properties
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -     -Dlogback.configurationFile=file:/usr/local/lib/flink-1.9.1/conf/logback.xml
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         -  Program Arguments: (none)
(...)
2019-10-24 06:36:07,520 INFO  ....entrypoint.ClusterEntrypoint         - --------------------------------------------------------------------------------
2019-10-24 06:36:07,526 INFO  ....entrypoint.ClusterEntrypoint         - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-10-24 06:36:07,574 INFO  ....configuration.GlobalConfiguration            - Loading configuration property
----
properties summarized :
----
jobmanager.rpc.address, 192.168.10.11
jobmanager.rpc.port, 6123
jobmanager.heap.size, 1024m
taskmanager.heap.size, 1024m
taskmanager.numberOfTaskSlots, 2
parallelism.default, 2
state.checkpoints.dir, file:///var/lib/flink/data/ckeckpoints
state.savepoints.dir, file:///var/lib/flink/data/savepoints
jobmanager.execution.failover-strategy, region
rest.port, 8081
rest.bind-address, 0.0.0.0
jobmanager.archive.fs.dir, file:///var/lib/flink/completed_jobs/
historyserver.archive.fs.dir, file:///var/lib/flink/completed_jobs/
mesos.resourcemanager.tasks.container.type, docker
mesos.resourcemanager.tasks.container.image.name, eskimo:flink-worker
mesos.resourcemanager.tasks.bootstrap-cmd, export FLINK_HOME=/usr/local/lib/flink/
mesos.resourcemanager.tasks.container.volumes, /var/log/flink:/var/log/flink:RW,/var/lib/flink:/var/lib/flink:RW,/etc:/host_etc:RO
mesos.resourcemanager.tasks.cpus, 2
mesos.resourcemanager.tasks.mem, 1024
mesos.master, zk://192.168.10.11:2181/mesos
----
2019-10-24 06:36:07,753 INFO  ....entrypoint.ClusterEntrypoint         - Starting MesosSessionClusterEntrypoint.
(...)
2019-10-24 06:36:09,393 INFO  ....runtime.rpc.akka.AkkaRpcServiceUtils         - Trying to start actor system at 192.168.10.11:6123
(...)
(bla bla bla akka starting and all the rest)
(...)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - Registering as new framework.
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Mesos Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Master URL: zk://192.168.10.11:2181/mesos
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -  Framework Info:
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     ID: (none)
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Name: Flink
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Failover Timeout (secs): 604800.0
2019-10-24 06:36:14,842 INFO  ....MesosResourceManager  -     Role: *
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Capabilities: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Principal: (none)
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Host: 192.168.10.11
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  -     Web UI: http://192.168.10.11:8081
2019-10-24 06:36:14,843 INFO  ....MesosResourceManager  - --------------------------------------------------------------------------------
2019-10-24 06:36:14,877 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@192.168.10.11:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
(...)
2019-10-24 06:36:15,714 INFO  ....mesos.scheduler.ConnectionMonitor            - Connecting to Mesos...
2019-10-24 06:36:15,764 INFO  ....MesosResourceManager  - Mesos resource manager started.
(...)
2019-10-24 06:36:46,033 INFO  ....mesos.scheduler.ConnectionMonitor            - Connected to Mesos as framework ID e1432c43-f540-4d53-9cc7-b06e0f2a6525-0000.

The logs in mesos-master right after start of Flink App Master

I1024 06:36:52.096225    19 master.cpp:2920] Received SUBSCRIBE call for framework 'Flink' at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.097496    25 master.cpp:3001] Subscribing framework Flink with checkpointing enabled and capabilities [  ]
I1024 06:36:52.098603    25 master.cpp:10463] Adding framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with roles {  } suppressed
I1024 06:36:52.099602    18 hierarchical.cpp:373] Added framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:36:52.101519    18 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O1, 8977e508-0698-4c40-9317-cdc3cc8982f4-O2 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.185779    20 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.186064    25 hierarchical.cpp:1346] Suppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000

Logs in the Flink App Master at the first job submission (the one working fine)

2019-10-24 06:37:18,800 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Received JobGraph submission 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,804 INFO  ....runtime.dispatcher.StandaloneDispatcher      - Submitting job 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,889 INFO  ....runtime.rpc.akka.AkkaRpcService     - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_0 .
2019-10-24 06:37:18,933 INFO  ....JobMaster                           - Initializing job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305).
(...)
2019-10-24 06:37:19,160 INFO  ....jobmaster.JobManagerRunner          - JobManager runner for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0.
2019-10-24 06:37:19,175 INFO  ....JobMaster                           - Starting execution of job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) under job master id 00000000000000000000000000000000.
2019-10-24 06:37:19,177 INFO  ....ExecutionGraph                      - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state CREATED to RUNNING.
2019-10-24 06:37:19,191 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from CREATED to SCHEDULED.
2019-10-24 06:37:19,229 INFO  ....SlotPoolImpl                        - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}]
2019-10-24 06:37:19,265 INFO  ....JobMaster                           - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:37:19,272 INFO  ....JobMaster                           - Resolved ResourceManager address, beginning registration
2019-10-24 06:37:19,273 INFO  ....JobMaster                           - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:37:19,284 INFO  ....MesosResourceManager                - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:19,296 INFO  (...)
2019-10-24 06:37:19,305 INFO  ....JobMaster                           - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:37:19,306 INFO  ....SlotPoolImpl                        - Requesting new slot [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:37:19,311 INFO  ....MesosResourceManager                - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 50934ae259b85a449e9a789f82287305 with allocation id 907586ba9ece54f20c23341276e6bac8.
2019-10-24 06:37:19,313 INFO  ....MesosResourceManager                - Starting a new worker.
2019-10-24 06:37:19,458 INFO  ....MesosResourceManager                - Scheduling Mesos task taskmanager-00001 with (1024.0 MB, 2.0 cpus).
2019-10-24 06:37:19,473 INFO  ....mesos.scheduler.LaunchCoordinator   - Now gathering offers for at least 1 task(s).
2019-10-24 06:37:19,495 INFO  ....mesos.scheduler.LaunchCoordinator   - Received offer(s) of 9432.0 MB, 4.0 cpus, 0.0 gpus:
2019-10-24 06:37:19,499 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:19,500 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4 from 192.168.10.14 of 4793.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:20,517 INFO  ....mesos.scheduler.LaunchCoordinator   - Processing 1 task(s) against 2 new offer(s) plus outstanding offers.
2019-10-24 06:37:20,578 INFO  ....mesos.scheduler.LaunchCoordinator   - Resources considered: (note: expired offers not deducted from below)
2019-10-24 06:37:20,587 INFO  ....mesos.scheduler.LaunchCoordinator   -   192.168.10.13 has 4639.0 MB, 2.0 cpus, null gpus
2019-10-24 06:37:20,841 INFO  ....MesosResourceManager                - Launching Mesos task taskmanager-00001 on host 192.168.10.14.
2019-10-24 06:37:20,841 INFO  ....mesos.scheduler.LaunchCoordinator   - Launched 1 task(s) on 192.168.10.14 using 1 offer(s):
2019-10-24 06:37:20,842 INFO  ....mesos.scheduler.LaunchCoordinator   -   8977e508-0698-4c40-9317-cdc3cc8982f4-O4
2019-10-24 06:37:20,843 INFO  ....mesos.scheduler.LaunchCoordinator   - No longer gathering offers; all requests fulfilled.
2019-10-24 06:37:20,844 INFO  com.netflix.fenzo.TaskScheduler         - Expiring all leases
2019-10-24 06:37:20,846 INFO  ....mesos.scheduler.LaunchCoordinator   - Declined offer 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus.
2019-10-24 06:37:26,251 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 is running.
2019-10-24 06:37:29,575 INFO  ....MesosResourceManager                - Registering TaskManager with ResourceID taskmanager-00001 (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) at ResourceManager
2019-10-24 06:37:29,693 INFO  ....ExecutionGraph                      - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from SCHEDULED to DEPLOYING.
(...)
(everything going on fine)
(...)
2019-10-24 06:37:30,041 INFO  ....ExecutionGraph             - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state RUNNING to FINISHED.
2019-10-24 06:37:30,076 INFO  ....StandaloneDispatche r      - Job 50934ae259b85a449e9a789f82287305 reached globally terminal state FINISHED.
2019-10-24 06:37:30,205 INFO  ....FsJobArchivist             - Job 50934ae259b85a449e9a789f82287305 has been archived at file:/var/lib/flink/completed_jobs/50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:30,206 INFO  ....JobMaster                  - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019(50934ae259b85a449e9a789f82287305).
2019-10-24 06:37:30,226 INFO  ....SlotPoolImpl               - Suspending SlotPool.
2019-10-24 06:37:30,227 INFO  ....JobMaster                  - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:37:30,227 INFO  ....SlotPoolImpl               - Stopping SlotPool.
2019-10-24 06:37:30,232 INFO  ....MesosResourceManager  - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305 from the resource manager.
2019-10-24 06:37:30,232 INFO  ....jobmaster.JobManagerRunner  - JobManagerRunner already shutdown.

The corresponding time in mesos:

I1024 06:37:19.475060    23 hierarchical.cpp:1387] Unsuppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.475097    23 hierarchical.cpp:1409] Revived roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.476991    19 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O3, 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.845321    19 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
(...)
I1024 06:37:20.856938    23 master.cpp:4590] Processing ACCEPT call for offers: [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.860610    20 master.cpp:4151] Adding task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:20.861378    20 master.cpp:5569] Launching task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"ports","ranges":{"range":[{"begin":31000,"end":31001}]},"type":"RANGES"}] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) on  new executor
I1024 06:37:22.788302    24 master.cpp:8703] Status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:22.788444    24 master.cpp:8760] Forwarding status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:22.790689    24 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_STARTING, status update state: TASK_STARTING)
(...)
I1024 06:37:26.241962    23 master.cpp:8703] Status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:26.242094    23 master.cpp:8760] Forwarding status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:26.243048    23 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)

After a few minutes, the idle timeout of the TaskManager kicks in and TaskManager is killed

2019-10-24 06:38:21,649 INFO  ....MesosResourceManager                - Stopping worker taskmanager-00001.
2019-10-24 06:38:21,672 INFO  ....MesosResourceManager                - Closing TaskExecutor connection taskmanager-00001 because: TaskExecutor exceeded the idle timeout.
2019-10-24 06:38:21,760 WARN  ....MesosResourceManager                - Discard registration from TaskExecutor taskmanager-00001 at (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) because the framework did not recognize it
2019-10-24 06:38:26,674 INFO  ....mesos.scheduler.TaskMonitor         - Re-attempting to kill Mesos task taskmanager-00001.
2019-10-24 06:38:26,891 WARN  akka.remote.ReliableDeliverySupervisor  - Association with remote system [akka.tcp://flink@192.168.10.14:31001] has failed, address is now gated for [50] ms. Reason: [Disassociated] 
2019-10-24 06:38:27,318 INFO  ....mesos.scheduler.TaskMonitor         - Mesos task taskmanager-00001 exited as planned.
2019-10-24 06:38:27,326 INFO  ....MesosResourceManager                - Worker taskmanager-00001 finished successfully with message: Container exited with status 137

Corresponding logs in mesos:

I1024 06:38:26.678206    20 master.cpp:6261] Processing KILL call for task 'taskmanager-00001' of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:26.678450    20 master.cpp:6339] Telling agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) to kill task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:27.304287    20 master.cpp:8703] Status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:38:27.304426    20 master.cpp:8760] Forwarding status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
(...)
I1024 06:38:27.319533    19 master.cpp:11770] Removing task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)

Second job submission in App Master (the one that stays hangs and is stopped after timeout without any processing)

2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Received JobGraph submission 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,220 INFO  ....StandaloneDispatcher    - Submitting job 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,221 INFO  ....AkkaRpcService          - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_1 .
2019-10-24 06:39:25,225 INFO  ....JobMaster               - Initializing job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,228 INFO  ....JobMaster               - Using restart strategy NoRestartStrategy for Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,230 INFO  ....ExecutionGraph          - Job recovers via failover strategy: New Pipelined Region Failover
2019-10-24 06:39:25,231 INFO  ....JobMaster               - Running initialization on master for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,232 INFO  ....JobMaster               - Successfully ran initialization on master in 1 ms.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Start building failover regions.
2019-10-24 06:39:25,234 INFO  ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy  - Created 1 failover regions.
2019-10-24 06:39:25,235 INFO  ....jobmaster.JobManagerRunner           - JobManager runner for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1.
2019-10-24 06:39:25,236 INFO  ....JobMaster               - Starting execution of job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) under job master id 00000000000000000000000000000000.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - Job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) switched from state CREATED to RUNNING.
2019-10-24 06:39:25,236 INFO  ....ExecutionGraph          - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from CREATED to SCHEDULED.
2019-10-24 06:39:25,237 INFO  ....SlotPoolImpl            - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}]
2019-10-24 06:39:25,238 INFO  ....JobMaster               - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Resolved ResourceManager address, beginning registration
2019-10-24 06:39:25,239 INFO  ....JobMaster               - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:39:25,240 INFO  ....MesosResourceManager  - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:39:25,245 INFO  ....JobMaster               - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:39:25,246 INFO  ....SlotPoolImpl            - Requesting new slot [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:39:25,246 INFO  ....MesosResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 4888c165c1a9eae4dd69cd533c66a5c1 with allocation id 04a6f120eed690c9c7cdc8c7c0f67ed7.

At this stage the logs stay so for a few minutes. There is no activity whatsoever in Mesos.

The problem seems to me to be clearly in the Flink App Master / Resource Manager since again noting happens in mesos. There is no activity whatsoever in mesos master logs at that moment (and the Mesos Console shows nothing, no new Task being submitted or executed, both agents are doing nothing).

And after a few minutes a timeout occurs and the job is stopped in FAILED state.

Second job failing after timeout

2019-10-24 06:44:25,245 INFO  ....ExecutionGraph        - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) org.apache.flink.api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from SCHEDULED to FAILED.
.... NoResourceAvailableException: Could not allocate enough slots to run the job. Please make sure that the cluster has enough resources.
    at org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:460)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
    (stack traces goes on)
2019-10-24 06:44:25,261 INFO  ....executiongraph...AdaptedRestartPipelinedRegionStrategyNG  - Fail to pass the restart strategy validation in region failover. Fallback to fail global.
(...)
(a few more stack traces)
(...)
2019-10-24 06:44:25,296 INFO  ....ExecutionGraph                   - Discarding the results produced by task execution d7d19ea639a4adf9246f905f1765b338.
2019-10-24 06:44:25,296 INFO  ....SlotPoolImpl                     - Pending slot request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] timed out.
2019-10-24 06:44:25,301 INFO  ....dispatcher.StandaloneDispatcher  - Job 4888c165c1a9eae4dd69cd533c66a5c1 reached globally terminal state FAILED.
2019-10-24 06:44:25,340 INFO  ....history.FsJobArchivist           - Job 4888c165c1a9eae4dd69cd533c66a5c1 has been archived at file:/var/lib/flink/completed_jobs/4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:44:25,341 INFO  ....JobMaster                        - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019(4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Suspending SlotPool.
2019-10-24 06:44:25,344 INFO  ....JobMaster                        - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:44:25,344 INFO  ....SlotPoolImpl                     - Stopping SlotPool.
2019-10-24 06:44:25,344 INFO  ....jobmaster.JobManagerRunner       - JobManagerRunner already shutdown.
2019-10-24 06:44:25,344 INFO  ....mesos....MesosResourceManager    - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1 from the resource manager.

For the record, the job I am submitting is a simple Batch Hello World from http://diegoreico.com/environments/runningflinkgclusterwithzeppelin/

UPDATE 2019-10-24 - 1:00 pm

Following log extracts with DEBUG level are available:

Flink App master startup : http://www.niceideas.ch/flink/1_flink_app_master_startup.txt
First job execution (successful) : https://www.niceideas.ch/flink/2_first_job_execution_successful.txt
Task Manager idle timeout kill : https://www.niceideas.ch/flink/3_task_manager_idle_kill.txt
Second job execution (failed) : https://www.niceideas.ch/flink/4_second_job_execution_failed.txt

Update 2019-10-24 - 5:30pm

Logs of Task Manager (Flink worker) are available here : https://www.niceideas.ch/flink/5_task_manger_logs.txt

Could you provide me with the full logs on `DEBUG` log level? — Till Rohrmann, Oct 24 '19 at 10:14
Hey Till, Thank you so much for looking into this. Question above is updated with links to DEBUG level log files. — Jerome Kehrli, Oct 24 '19 at 11:17
I'm pretty sure that you are running into this problem: https://issues.apache.org/jira/browse/FLINK-14074. In a nutshell, the started TaskExecutor is started with more memory than expected. Due to this, the reported slot profile does not match the pending slot profile and the latter remains in the SlotManager. When you submit the second job, the previous TaskExecutor has been terminated but the system still thinks that there will be a slot coming matching exactly the pending slot's profile. If you submit a second job, you will see that the system can actually execute both jobs. — Till Rohrmann, Oct 24 '19 at 14:16
Could you provide me with the complete logs of the `TaskExecutor`. I'd like to see with which memory settings it has been started. — Till Rohrmann, Oct 24 '19 at 14:28
Thx Till. That makes sense indeed. I will be eagerly monitoring FLINK-14074. , test patch as soon as MR is available and post my findings back here after testing it. In the meantime I have increased idle timeout on task manager to avoid the situation (might lead to other troubles but it's fine for my current use cases) and force restart of Job Manager when idle timeout cancels tasks. Let me get back to you with Task Executor DEBUG logs. Thx again. — Jerome Kehrli, Oct 24 '19 at 15:18
Task Manager logs added to question above (not the same execution as the App Master logs but under same conditions) — Jerome Kehrli, Oct 24 '19 at 15:42
Thanks a lot for your help Jerome. We will fix this problem with the next minor release. — Till Rohrmann, Oct 24 '19 at 18:39
As a workaround for the time being, you could set `taskmanager.memory.size` in your configuration used by the master and the `TaskExecutor` processes. — Till Rohrmann, Oct 25 '19 at 16:05
FYI: The fix has been merged into `master` and `release-1.9`. Hence, with the next 1.9 bug fix or the next major release the problem should be gone. — Till Rohrmann, Oct 31 '19 at 13:15

Apache Flink Resource Manager (App Master) fails allocating new Task Managers after the first one is killed after idle timeout

0 Answers0