My spark program is failing and neither the scheduler, driver or executors are providing any sort of useful error, apart from Exit status 137. What could be causing spark to fail?
The crash seems to happen during the conversion of an RDD to a Dataframe:
val df = sqlc.createDataFrame(processedData, schema).persist()
Right before the crash, the logs look like this:
Scheduler
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 WARN TaskSetManager: Stage 11 contains a task of very large size (22028 KB). The maximum recommended task size is 100 KB.
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 23, 10.141.1.247, executor 1133b735-967d-136c-2bbf-ffcb3884c88c-1548129213980, partition 0, PROCESS_LOCAL, 22557269 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 24, 10.141.3.144, executor a92ceb18-b46a-c986-4672-cab9086c54c2-1548129202094, partition 1, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 25, 10.141.1.56, executor b9167d92-bed2-fe21-46fd-08f2c6fd1998-1548129206680, partition 2, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 26, 10.141.3.146, executor 0cf7394b-540d-2a6c-258a-e27bbedbdd0e-1548129212488, partition 3, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:09 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
...
19/01/22 04:13:45 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
19/01/22 04:13:46 INFO JobUtils: driver Terminated -- Exit status 137
19/01/22 04:13:46 INFO JobUtils: driver Restarting -- Restart within policy
Driver
19/01/22 04:01:12 INFO DAGScheduler: Job 7 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 8.008375 s
19/01/22 04:01:12 INFO SparkHadoopMapReduceWriter: Job job_20190122040104_0032 committed.
19/01/22 04:01:13 INFO MapPartitionsRDD: Removing RDD 28 from persistence list
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
Executors (Some variation of this)
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
19/01/22 04:13:45 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.141.2.48:21297 disassociated! Shutting down.
19/01/22 04:13:45 INFO DiskBlockManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Deleting directory /alloc/spark-ce736cb6-8b8e-4891-b9c7-06ea9d9cf797