Final stage of the Spark job is to save 37Gb of data to GCS bucket in avro format. Spark app is run on Dataproc.
My cluster consists of: 15 workers with 4 cores and 15Gb RAM, 1 master with 4 cores and 15Gb RAM.
I use the following code:
df.write.option("forceSchema", schema_str) \
.format("avro") \
.partitionBy('platform', 'cluster') \
.save(f"gs://{output_path}")
Final statistics from executors:
In 4 attempts by Spark to run one of the failed tasks, the error codes I get are:
1/4. java.lang.StackOverflowError
2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50
[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
Spark UI gives me this:
From UI it's apparent that something is going on with data distribution, but repartitioning gives the same StackOverflow error.
So the two questions I want to ask:
how do I decode the message 'container prelaunch-error' in context of StackOverflow error?
why other actions in the job run safely, despite the same data distribution ?