If I am understanding the documentation correctly, partitioning a dataframe, vs partitioning a hive or other on-disk table, seem to be different. For on-disk storage, partitioning by, say, date creates a set of partitions for each date which occurs in my dataset. This seems useful; if I query records for a given date, every node in my cluster processes only partitions corresponding to the date I want.
Dataframe.repartition, on the other hand, creates one partition for each date which occurs in my dataset. If I search for records from a specific date, they will all be found in a single partition and thus all processed by a single node.
Is this right? If so, what is the use case? What is the way to get the speed advantage of on-disk partitioning schemes in the context of a dataframe?
For what it's worth, I need the advantage after I do an aggregation of on-disk data, so the on-disk partitioning doesn't necessarily help me even with delayed execution.