The situation where you want to execute a Spark SQL query on only partition columns comes up pretty often. For example, let's say you want to programmatically get the latest date
from a table (where date
is a partition column). Conceptually, you can do something like this:
val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
.orderBy($"date".desc)
.limit(1).collect
I'm assuming that "date"
is a column you can order by, perhaps it's a string formatted by "YYYYmmDD"
such as 20220908
Sometimes I've seen code similar to this be very slow, i.e. it's clearly reading the underlying file data and sometimes I've seen it be nearly instant, i.e. probably only accessing table metadata about the partition columns without reading the files.
Questions are: What are the conditions under which this won't have to read the underlying full row data? Specifically,
- What has to be true about the storage format or metadata?
- Which transformations will allow this to work without reading the full row data?
I'm using Spark 2.4.2