SparkSQL job fails when calling stddev over 1,000 columns

Question

I am on DataBricks with Spark 2.2.1 and Scala 2.11. I am attempting to run a SQL query that looks like the following.

select stddev(col1), stddev(col2), ..., stddev(col1300)
from mydb.mytable

I then execute the code as follows.

myRdd = sqlContext.sql(sql)

However, I see the following exception thrown.

Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificMutableProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean evalExprIsNull;
/* 010 */   private boolean evalExprValue;
/* 011 */   private boolean evalExpr1IsNull;
/* 012 */   private boolean evalExpr1Value;
/* 013 */   private boolean evalExpr2IsNull;
/* 014 */   private boolean evalExpr2Value;
/* 015 */   private boolean evalExpr3IsNull;
/* 016 */   private boolean evalExpr3Value;
/* 017 */   private boolean evalExpr4IsNull;
/* 018 */   private boolean evalExpr4Value;
/* 019 */   private boolean evalExpr5IsNull;
/* 020 */   private boolean evalExpr5Value;
/* 021 */   private boolean evalExpr6IsNull;

The stacktrace just goes on and on, and even the Databricks notebook crashes because of the verbosity. Anyone ever seen this?

Also, I have the following 2 SQL statements to get the average and median that I execute without any problems.

select avg(col1), ..., avg(col1300) from mydb.mytable
select percentile_approx(col1, 0.5), ..., percentile_approx(col1300, 0.5) from mydb.mytable

The problem seems to be with stddev but the exception is not helpful. Any ideas on what's going on? Is there another way to compute the standard deviation easily that won't lead to this problem?

It turns out this post is describing the same problem, saying that Spark cannot handle wide schemas or a lot of columns due to the limitation of 64KB sized classes. However, if that's the case, then why does avg and percentile_approx work?

score 2 · Accepted Answer · answered May 19 '18 at 14:34

A few options:

Try disabling whole stage code generation:

spark.conf.set("spark.sql.codegen.wholeStage", false)

If the above doesn't help switch to RDD (adopted from this answer by zeo323):

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer

val columns: Seq[String] = ???

df
  .select(columns map (col(_).cast("double")): _*)
  .rdd
  .map(row => Vectors.dense(columns.map(row.getAs[Double](_)).toArray))
  .aggregate(new MultivariateOnlineSummarizer)(
     (agg, v) => agg.add(v), 
     (agg1, agg2) => agg1.merge(agg2))

Assemble columns into a single vector, using VectorAssembler and use Aggregator, similar to the one used here, adjusting finish method (you might need some additional tweaking to convert ml.linalg.Vectors to mllib.linalg.Vectors).

However, if that's the case, then why does avg and percentile_approx work?

Spark literally generates Java code for these stages. Because logic is not the same, output size will differ.

SparkSQL job fails when calling stddev over 1,000 columns

1 Answers1

Linked