I am on DataBricks with Spark 2.2.1 and Scala 2.11. I am attempting to run a SQL query that looks like the following.
select stddev(col1), stddev(col2), ..., stddev(col1300)
from mydb.mytable
I then execute the code as follows.
myRdd = sqlContext.sql(sql)
However, I see the following exception thrown.
Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificMutableProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private boolean evalExprIsNull; /* 010 */ private boolean evalExprValue; /* 011 */ private boolean evalExpr1IsNull; /* 012 */ private boolean evalExpr1Value; /* 013 */ private boolean evalExpr2IsNull; /* 014 */ private boolean evalExpr2Value; /* 015 */ private boolean evalExpr3IsNull; /* 016 */ private boolean evalExpr3Value; /* 017 */ private boolean evalExpr4IsNull; /* 018 */ private boolean evalExpr4Value; /* 019 */ private boolean evalExpr5IsNull; /* 020 */ private boolean evalExpr5Value; /* 021 */ private boolean evalExpr6IsNull;
The stacktrace just goes on and on, and even the Databricks notebook crashes because of the verbosity. Anyone ever seen this?
Also, I have the following 2 SQL statements to get the average and median that I execute without any problems.
select avg(col1), ..., avg(col1300) from mydb.mytable
select percentile_approx(col1, 0.5), ..., percentile_approx(col1300, 0.5) from mydb.mytable
The problem seems to be with stddev
but the exception is not helpful. Any ideas on what's going on? Is there another way to compute the standard deviation easily that won't lead to this problem?
It turns out this post is describing the same problem, saying that Spark cannot handle wide schemas or a lot of columns due to the limitation of 64KB sized classes. However, if that's the case, then why does avg
and percentile_approx
work?