2

I'm trying to pull a 126 Gb table out of HAWQ (PostgreSQL, in this case 8.2) into Spark and it is not working. I can pull smaller tables no problem. For this one I keep getting the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): ExecutorLostFailure (executor driver lost)
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

My cluster specifications are as follows: 64 cores, 512 Gb of RAM, 2 nodes
This is a Spark standalone cluster on the 2 nodes (trust me, I'd like more nodes, but that's all I get). So I have one node as a pure slave, and the other node houses both the master and the other slave.

I've tried many configurations of memory allocations with the spark-submit job, I'll list a few here, none of which worked:

    // CONFIG_5: FAIL (96 Gb driver  144 Gb executor)
    --driver-memory 96g --executor-memory 6g --num-executors 24 --executor-cores 24 

    // CONFIG_4: FAIL (48 Gb driver  196 Gb executor)
    --driver-memory 48g --executor-memory 8g --num-executors 24 --executor-cores 24  

    //CONFIG_3: FAIL (120 Gb driver  128 Gb executor)
    --driver-memory 120g --executor-memory 4g --num-executors 32 --executor-cores 32   

    // CONFIG_2: FAIL (156 driver  96 executor)
    --driver-memory 156g --executor-memory 4g --num-executors 24 --executor-cores 24   

    // CONFIG_1: FAIL (224 Gb driver  48 Gb executor)
    --driver-memory 224g --executor-memory 1g --num-executors 1 --executor-cores 48

The error is the same each time -- ExecutorLostFailure (executor driver lost)

Greg Chase
  • 173
  • 8
WaveRider
  • 475
  • 4
  • 10
  • One thing you can try to do is to increase the `spark.akka.frameSize` configuration parameter since that controls the cap on message sizes between the executor and driver. – Rohan Aletty Oct 08 '15 at 01:01
  • Okay, so I would remove the `--num-executors` and `--executor-cores` - the default number of executor cores are _"all the available cores on the worker in standalone mode"_ (taken from the documentation). Then I would lower the driver-memory (to say 2G) as the driver usually needs only a little memory and then heavily increase the executor-memory to 0.75*available_memory_on_each_node. – Glennie Helles Sindholt Oct 08 '15 at 13:49
  • I tried what you suggested @GlennieHellesSindholt, --driver-memory 24g --executor-memory 192g but no luck. Definitely need the driver-memory over 10g which was lessons learned from other jobs that failed/succeeded. – WaveRider Oct 09 '15 at 17:09
  • Also @Rohan I set spark.akka.frameSize to 2047, which is the maximum allowed, no success. – WaveRider Oct 09 '15 at 17:12
  • First, add the code you're using to read the table. Second, instead of reading the whole table try to read a single file of it (created by a single HAWQ segment) and see whether it would work or not. Then get the logs of failed executor, master log has no relevant information. Most likely it is OOM error, and it is important to see what happened. If it is OOM, set the `-XX:+HeapDumpOnOutOfMemoryError` for executors to analyze the dump and see what exactly has taken all the RAM – 0x0FFF Nov 24 '15 at 10:54
  • Have you solved this? If yes, how? – Matteo Guarnerio Nov 24 '15 at 16:01
  • 1
    @MatteoGuarnerio, what finally worked was removing the `--num-executors` and `--executor-cores` as @Glennie suggested, but then allocating as much ram as possible to the driver. It was a while ago, but I think there was a join in the Spark program, which reshuffles partitions, which is very driver-memory intensive on large data sets. As always, starting off small, with 50% of the data set, then go up. Also, allow it to run, sometimes the Spark errors resolve and the job finishes...sometimes. – WaveRider Dec 01 '15 at 01:39

0 Answers0