2

H2O Sparkling water often throws below exception, we are rerunning it manually whenever this happens. The Issue is the spark job doesn't exit when this exception occurs, they don't return exit status and we are not able to automate this process.

App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 316 in stage 22.0 failed 4 times, most recent failure: Lost task 316.3 in stage 22.0 (TID 9470, ip-**-***-***-**.ec2.internal): java.lang.ArrayIndexOutOfBoundsException: 65535
App > at water.DKV.get(DKV.java:202)
App > at water.DKV.get(DKV.java:175)
App > at water.Key.get(Key.java:83)
App > at water.fvec.Frame.createNewChunks(Frame.java:896)
App > at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
App > at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
App > at org.apache.spark.h2o.backends.internal.InternalWriteConverterContext.createChunks(InternalWriteConverterContext.scala:28)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$class.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:86)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
App > at org.apache.spark.scheduler.Task.run(Task.scala:85)
App > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
App > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
App > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  • This usually happens when H2O cluster fails - do you see any other exceptions in the log? Could you share your usecase and env? – Mateusz Dymczyk Apr 20 '17 at 22:05

1 Answers1

0

This issue is being investigated at following issues of the Sparkling Water project:

It seems somehow related to the size of the data.

This happens when we try to pull a huge spark dataframe to h2o frame. 63m records x 6300 columns. Although H2O/ Sparkling Water cluster sized properly: (there are 40 executors x 17g of memory each, and each Spark executor has 4 threads/ cores) So total amount of memory is 680Gb

We never get this error on smaller datasets.

Tagar
  • 13,911
  • 6
  • 95
  • 110