7

I am running this code on EMR 4.6.0 + Spark 1.6.1 :

val sqlContext = SQLContext.getOrCreate(sc)
val inputRDD = sqlContext.read.json(input)

try {
    inputRDD.filter("`first_field` is not null OR `second_field` is not null").toJSON.coalesce(10).saveAsTextFile(output)
    logger.info("DONE!")
} catch {
    case e : Throwable => logger.error("ERROR" + e.getMessage)
}

In the last stage of saveAsTextFile, it fails with this error:

16/07/15 08:27:45 ERROR codegen.GenerateUnsafeProjection: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
/* 001 */
/* 002 */ public java.lang.Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] exprs) {
/* 003 */   return new SpecificUnsafeProjection(exprs);
/* 004 */ }
(...)

What could be the reason? Thanks

Nhan Trinh
  • 191
  • 1
  • 9
  • 1
    Interesting. Each java class have a constant pool for holding everything constant, include even like method names. `u2 constant_pool_count`, so the the max number of constants is 0xFFFF. I use simple json to test which does not throw an exception. Why this code generate so many constant? Will it be possible post part of the json data? – Rockie Yang Jul 15 '16 at 10:45
  • @RockieYang it's not possible to upload my json but it consists of about 90 String / Number fields. – Nhan Trinh Jul 15 '16 at 10:49
  • You just need add one json. And have you tested if it's related number of rows? – Rockie Yang Jul 15 '16 at 10:52
  • @RockieYang I've just solved this problem, you can check the answer below. About your question, the job runs over about 400GB~ data. – Nhan Trinh Jul 15 '16 at 11:07

3 Answers3

6

Solved this problem by dropping all the unused column in the Dataframe, or just filter columns you actually need.

Turnes out Spark Dataframe can not handle super wide schemas. There is no specific number of columns where Spark might break with “Constant pool has grown past JVM limit of 0xFFFF” - it depends on kind of query, but reducing number of columns can help to workaround this issue.

The underlying root cause is in JVM's 64kb for generated Java classes - see also Andrew's answer.

Tagar
  • 13,911
  • 6
  • 95
  • 110
Nhan Trinh
  • 191
  • 1
  • 9
  • 1
    I just did a test, spark works on 100 columns if it's just simple column. If it's complex column with many embed structures, when the total amount of field exceed certain number, it will eventually fail. How many fields in total in your json? – Rockie Yang Jul 15 '16 at 11:30
  • About 90 in general and some of the fields are nested so overall I would say over 100. – Nhan Trinh Jul 15 '16 at 12:31
  • It must be something special in your json structure. I tested with 4000 simple columns, and it works. Anyway its good that you solve it. – Rockie Yang Jul 15 '16 at 12:49
  • Is there still no way around ? – Boern Jan 17 '17 at 12:02
  • It depends on your workflow/query/transformations etc. Your estimate of 100 columns is generally not correct. We have some workflows that have 11,000 columns and it works fine. Some other workflows run into this problem at 3,000 columns. Edited answer. – Tagar Dec 22 '17 at 19:06
5

This is due to known limitation of Java for generated classes to go beyond 64Kb.

This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3 - will be released in Jan/2018.

Tagar
  • 13,911
  • 6
  • 95
  • 110
Andrew
  • 3,272
  • 2
  • 25
  • 26
2

For future reference, this issue was fixed in spark 2.3 (As Andrew noted).

If you encounter this issue on Amazon EMR, upgrade to release version 5.13 or above.

Uri Goren
  • 13,386
  • 6
  • 58
  • 110
  • and if you use AWS Glue with spark version 2.2.x Good luck learning new version release date. – halil Jan 30 '19 at 12:53