3

Trying to create a spark data frame with multiple additional columns based on conditions like this

df
    .withColumn("name1", someCondition1)
    .withColumn("name2", someCondition2)
    .withColumn("name3", someCondition3)
    .withColumn("name4", someCondition4)
    .withColumn("name5", someCondition5)
    .withColumn("name6", someCondition6)
    .withColumn("name7", someCondition7)

I am faced with the following exception in case more than 6 .withColumn clauses are added

org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB

This problem has been reported elsewhere as well e.g.

Is there a property in spark where I can configure the size?

edit

if even more columns are created e.g. around 20 I do no longer receive the aforementioned exception, but rather get the following error after 5 minutes of waiting:

java.lang.OutOfMemoryError: GC overhead limit exceeded

What I want to perform is a spelling/error correction. some simple cases could be handled easily via a map& replacement in a UDF. Still, several other cases with multiple chained conditions remain.

I will also follow up there: https://issues.apache.org/jira/browse/SPARK-18532

A minimal reproducible example can be found here https://gist.github.com/geoHeil/86e5401fc57351c70fd49047c88cea05

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

2

This error is caused by WholeStageCodegen and JVM issue.

Quick answer: no, you cannot change the limit. Please look at this question, 64KB is the maximum method size in JVM.

We must wait for a workaround in Spark, currently there's nothing you can change in system parameters

Community
  • 1
  • 1
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61