1

Final stage of the Spark job is to save 37Gb of data to GCS bucket in avro format. Spark app is run on Dataproc.

My cluster consists of: 15 workers with 4 cores and 15Gb RAM, 1 master with 4 cores and 15Gb RAM.

I use the following code:

df.write.option("forceSchema", schema_str) \
            .format("avro") \
            .partitionBy('platform', 'cluster') \
            .save(f"gs://{output_path}")

Final statistics from executors: enter image description here

In 4 attempts by Spark to run one of the failed tasks, the error codes I get are:

1/4. java.lang.StackOverflowError

2/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
Container id: container_1607696154227_0002_01_000028
Exit code: 50

[2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)

3/4. java.lang.StackOverflowError
4/4. Job aborted due to stage failure: Task 29 in stage 13.0 failed 4 times, most recent failure: Lost task 29.3 in stage 13.0 (TID 3048, ce-w1.internal, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_1607696154227_0002_01_000028 on host: ce-w1.internal. Exit status: 50. Diagnostics: [2020-12-11 15:46:19.880]Exception from container-launch.
    Container id: container_1607696154227_0002_01_000028
    Exit code: 50
    
    [2020-12-11 15:46:19.881]Container exited with a non-zero exit code 50. Error file: prelaunch.err.
    Last 4096 bytes of prelaunch.err :
    Last 4096 bytes of stderr :
    readOrdinaryObject(ObjectInputStream.java:2187)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)

Spark UI gives me this:

enter image description here

From UI it's apparent that something is going on with data distribution, but repartitioning gives the same StackOverflow error.

So the two questions I want to ask:

  1. how do I decode the message 'container prelaunch-error' in context of StackOverflow error?

  2. why other actions in the job run safely, despite the same data distribution ?

mktplus
  • 504
  • 1
  • 4
  • 17

1 Answers1

0

the problem is not due to your cluster capacities, it is due to the fact that you are working with an avro format and you are forcing spark to write a new schema while saving try to not use the postdefined schema, It will work. If you want to chance the schema just do it before saving via withColumn for example,please to check the number of shuffle too.

df.write.format("avro") \
            .partitionBy('platform', 'cluster') \
            .save(f"gs://{output_path}")
itIsNaz
  • 621
  • 5
  • 11
  • Same picture repeats: save stage lasts for 6s and drops: StackOverflow + container prelaunch errors – mktplus Dec 13 '20 at 20:54
  • the StackOverflow error is generally due to recursion please check this link, fpr your cluster I think that the number of workers and thier prpos are enough. Check if you have any recursion. Otherwise if you can share your script it will be perfect because Spark is lazely evaluating your code. https://stackoverflow.com/questions/3197708/what-causes-a-java-lang-stackoverflowerror – itIsNaz Dec 13 '20 at 22:10
  • I set the parameter spark.shuffle.partitions to N_CORES * 3 and the error is gone, although the save stage is devishly slow (~80% of the job, 50 min for 30Gb input). Being set to N_CORES*4, spark.shuffle.partitions leads to StackOverflow and container from a bad node errors – mktplus Dec 14 '20 at 11:53
  • Perfect can you share the script so the community can better help you , so I think that you still work on the optimization of the spark code like to avoid the maximum the aggregation, group by...etc – itIsNaz Dec 14 '20 at 20:20