I have searched a lot for a succinct answer, hopefully someone can help me with some clarity on databricks partitioning..
assume i have a data frame with columns: Year
, Month
, Day
, SalesAmount
, StoreNumber
I want to store this partitioned by Year, & Month.. so i can run the following command:
df.write.partitionBy('Year', 'Month').format('csv').save('/mnt/path/', header='true')
This will output data in the format of: /path/Year=2019/Month=05/<file-0000x>.csv
If i then load it back again, such as:
spark.read.format('csv').options(header='true').load('/mnt/path/').createOrReplaceTempView("temp1")
Q1: This has not yet actually 'read' the data yet, right? i.e. i could have billions of records.. but until i actually query temp1
, nothing is executed against the source?
Q2-A: Subsequently, when querying this data using temp1
, it is my assumption that if i include the items that were used in the partitioning in the where clause, a smart filtering on the actual files that are read off the disk will be applied?
%sql
select * from temp1 where Year = 2019 and Month = 05 -- OPTIMAL
whereas the following would not do any file filtering as it has no context of which partitions to look in:
%sql
select * from temp1 where StoreNum = 152 and SalesAmount > 10000 -- SUB-OPTIMAL
Q2-B: Finally, if i stored the files in parquet format (rather than *.csv).. would both of the queries above 'push down' in to the actual data stored.. but in perhaps different ways?
I.e. the first would still use the partitions, but the second (where StoreNum = 152 and SalesAmount > 10000
) will now use columnar storage of parquet? While *.csv does not have that optimisation?
Can anyone please clarify my thinking / understanding around this?
links to resources would be great also..