I am trying to read a hive table in spark as a strongly typed Dataset
, and I am noticing that the partitions are not being pruned as opposed to doing a Spark SQL on a dataframe from the same hive table.
case class States(state: String, country: String)
val hiveDS = spark.table("db1.states").as[States]
//no partition pruning
hiveDS.groupByKey(x=>x.country).count().filter(x=>x._1 == "US")
states is partitioned by country, so when I do a count on the above Dataset, the query scans all the partitions. However if I read it as such -
val hiveDF = spark.table("db1.states")
//correct partition pruning
hiveDF.groupByKey("country").count().filter(x=>x._1 == "US")
The partitions are pruned correctly. Can anyone explain why partition information is lost when you map a table to a case class?