1

I am using spark to read data from a Hive table, and what I really want is a strongly typed Dataset

Here's what I am doing, and this works:

val myDF = spark.sql("select col1, col2 from hive_db.hive_table")

// Make sure that the field names in the case class exactly match the hive column names
case class MyCaseClass (col1: String, col2: String)

val myDS = myDF.as[myCaseClass]

The problem I have is that my Hive table is very long and many of the columns are structs, so its not trivial to define the case class

Is there a way to create a Dataset without the need to create a case class? I was wondering that since Hive already has all the column names defined as well as the data types is there a way to create a Dataset directly?

zero323
  • 322,348
  • 103
  • 959
  • 935
Shay
  • 505
  • 1
  • 3
  • 19

1 Answers1

3

TL;DR The short answer is there is no such option. Dataset is defined in terms of the stored type, and it's Encoder, so you cannot just skip type.

In practice there are different options you can explore, including Scala reflection, Macros, and code generation, to derive required types from the table metatdata. Some of these have been successfully used in the wild (you can check macro usage in ScalaRelational or code generation in ScalaLikeJDBC) to solve similar problems. As today there are no built-in tools that play a similar role in Apache Spark.

However if schema is quite complex it might be a dead end for a number of reasons, including, but not limited to:

  • Runtime overhead of "typed" transformations.
  • Platform limitations like limit on the number of arguments of JVM methods (see for example SI-7324) or JVM code size limits.
  • Usability, especially when Scala reflection is used. While code generation can provide pretty decent user experience, the remaining options are arguably not better than working with a simple named bag of Any's (a.k.a o.a.s.sql.Row).
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935