Spark, working around 254 case class limit

Question

I have a table with several hundreds of fields. More than the maximum amount of fields allowed for a case class. What this does is that it makes it impossible to turn a generic dataframe into a dataset with more than 254 fields without leaving it as row encoded.

.as[CaseClassNeed > 254 Fields] crashes due to JVM exception

What have people done to work around the issue? Is the only way to start grouping fields together together in a nested fashion? I'd like to avoid that as it causes problems with usability in the sense there is already code that is dependent on the case class not using nested fields.

We also had a similar issue and we attempted to solve it by changing the approach using RDD rather using data frame or data set, however, we are still looking for the best solution to this issue. — H Roy, Dec 20 '18 at 06:43
It effects other things that are dependent on there being a defined schema. — adrian, Dec 20 '18 at 07:31
@adrian: A dataframe does have a defined schema (try `printSchema` on a dataframe). The main difference compared to a dataset is that there is no type safety at compile-time (see e.g.: https://stackoverflow.com/questions/37301226/difference-between-dataset-api-and-dataframe-api). — Shaido, Dec 20 '18 at 09:26
Considering that this is a hard limit of the JVM, completely unrelated to Spark, and case classes functionality depends on their constructors, there is not much you can do. You might try to design bean-like classes (standard Beans won't work for the same reason) with encoders which depend only only getters and setters, but ultimately this would be yet another Row. — zero323, Dec 20 '18 at 12:27
@shaido type safety at compile time is a schema effectively isn't it? Well in general that's important for code quality in my project moving forward so looking for solutions — adrian, Dec 20 '18 at 20:56
If the data source is a Relational table, then the use of Dataset is a futile endeavor because schema conformance is taken care of by the Database itself. You can straight away use the Dataframe API in this case. The limit on the case class constructor parameters is not set to 254, but the **case classes in Scala are implemented as Tuples** and the maximum number of elements in a Tuple cannot exceed 22 hence the limit becomes implicitly applied to Case Classes and there is no way around it, till Dotty(Scala 3) comes out at-least — Yayati Sule, Jun 15 '20 at 19:54

Spark, working around 254 case class limit

0 Answers0