Spark SQL on partition columns without reading full row data

Question

The situation where you want to execute a Spark SQL query on only partition columns comes up pretty often. For example, let's say you want to programmatically get the latest date from a table (where date is a partition column). Conceptually, you can do something like this:

val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
                     .orderBy($"date".desc)
                     .limit(1).collect

I'm assuming that "date" is a column you can order by, perhaps it's a string formatted by "YYYYmmDD" such as 20220908

Sometimes I've seen code similar to this be very slow, i.e. it's clearly reading the underlying file data and sometimes I've seen it be nearly instant, i.e. probably only accessing table metadata about the partition columns without reading the files.

Questions are: What are the conditions under which this won't have to read the underlying full row data? Specifically,

What has to be true about the storage format or metadata?
Which transformations will allow this to work without reading the full row data?

I'm using Spark 2.4.2

Does this answer your question? [How do I enable partition pruning in spark](https://stackoverflow.com/questions/39898067/how-do-i-enable-partition-pruning-in-spark) — Gabio, Sep 08 '22 at 20:03
It's somewhat related but definitely not a direct answer. This linked question is about partition pruning, and my question is more about whether Spark can *only* use the partition metadata to return a result in some cases, and if so, when. — Chris A., Sep 08 '22 at 20:12
https://stackoverflow.com/questions/55053218/pyspark-getting-latest-partition-from-hive-partitioned-column-logic — Young, Sep 09 '22 at 09:05
@Young this is more along the lines I had in mind but without using `show partitions` since processing it programmatically seems to depend on the underlying table format and the structure of the `show partitions` output, whereas if you'll notice, my code above in the question doesn't care about these things. — Chris A., Sep 12 '22 at 04:17
use spark.sessionState.catalog.listPartitionNames to get a list of partitions and then you can get the max. no need to extract information from spark.sql("show partitions xxx") — Young, Sep 12 '22 at 08:29

score 2 · Accepted Answer · answered Sep 14 '22 at 17:27

After a little more digging I found these two JIRAs answering my question exactly.

It seems that the first JIRA was closed in favor of the second because it describes the problem better.

The summary from these two seems to be that there used to be config parameter spark.sql.optimizer.metadataOnly that used to allow what I wanted to do, but this can lead to incorrect results if the underlying files have zero rows. So the parameter and the whole concept was deprecated and removed to favor correctness over speed.

Even my example in the original post shows a case that could give two different results if the underlying file data had zero rows:

val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
                     .orderBy($"date".desc)
                     .limit(1).collect

For example, if the latest date for which there was metadata had zero rows then a full scan would yield the next latest date (which actually had rows), but the metadata only version would yield a the latest date that had metadata.

Spark SQL on partition columns without reading full row data

1 Answers1