Trying to create a spark data frame with multiple additional columns based on conditions like this
df
.withColumn("name1", someCondition1)
.withColumn("name2", someCondition2)
.withColumn("name3", someCondition3)
.withColumn("name4", someCondition4)
.withColumn("name5", someCondition5)
.withColumn("name6", someCondition6)
.withColumn("name7", someCondition7)
I am faced with the following exception in case more than 6 .withColumn
clauses are added
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
This problem has been reported elsewhere as well e.g.
- Spark ML Pipeline Causes java.lang.Exception: failed to compile ... Code ... grows beyond 64 KB
- https://github.com/rstudio/sparklyr/issues/264
Is there a property in spark where I can configure the size?
edit
if even more columns are created e.g. around 20 I do no longer receive the aforementioned exception, but rather get the following error after 5 minutes of waiting:
java.lang.OutOfMemoryError: GC overhead limit exceeded
What I want to perform is a spelling/error correction. some simple cases could be handled easily via a map& replacement in a UDF. Still, several other cases with multiple chained conditions remain.
I will also follow up there: https://issues.apache.org/jira/browse/SPARK-18532
A minimal reproducible example can be found here https://gist.github.com/geoHeil/86e5401fc57351c70fd49047c88cea05