0

When I run this Spark code in Scala:

df.withColumn(x, when(col(x).isin(values:_*),col(x)).otherwise(lit(null).cast(StringType)))

I face this Error:

     java.lang.RuntimeException: Compiling "GeneratedClass": Code of method
 "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql
 /catalyst /expressions/UnsafeRow;" of class
 "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
        at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
        at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)

df: Spark Dataset

x: StringType column, each row something like "US,Washington,Seattle"

values: Array[String]

Mario Galic
  • 47,285
  • 6
  • 56
  • 98
Hossein
  • 106
  • 8
  • You may want to check out https://stackoverflow.com/questions/50891509/apache-spark-codegen-stage-grows-beyond-64-kb – Lars Skaug Jul 25 '20 at 21:34
  • This can happen when your code is too long without any actions. You should cache your dataframe at some point. – Lamanus Jul 26 '20 at 14:19

1 Answers1

1

This is a known issue related to the growth of the bytecode. The common solution is to add checkpoints, i.e., to save your dataframe and read it back again.

See the following for further detail: Apache Spark Codegen Stage grows beyond 64 KB

Lars Skaug
  • 1,376
  • 1
  • 7
  • 13
  • I knew the issue, I was wondering if there is any alternative for my code (i.e. when, otherwise) that does not cause this error. – Hossein Aug 05 '20 at 21:20
  • Well, as Lamanus pointed out as well, caching your dataframe should prevent the error. – Lars Skaug Aug 05 '20 at 21:36