Spark needs to load the partition metdata first in the driver to know whether the partition exists or not. Spark will query the directory to find existing partitions to know if it can prune the partition or not during the scanning of the data.
I've tested this on Spark 2.0 and you can see in the log messages.
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year on driver
16/10/14 17:23:37 TRACE ListingFileCatalog: Listing s3a://mybucket/reddit_year/year=2007 on driver
This doesn't mean that we're scaning the files in each partition, but Spark will store the locations of the partitions for future queries on the table.
You can see the logs that it is actually passing in partition filters to prune the data:
16/10/14 17:23:48 TRACE ListingFileCatalog: Partition spec: PartitionSpec(StructType(StructField(year,IntegerType,true)),ArrayBuffer(PartitionDirectory([2012],s3a://mybucket/reddit_year/year=2012), PartitionDirectory([2010],s3a://mybucket/reddit_year/year=2010), ...PartitionDirectory([2015],s3a://mybucket/reddit_year/year=2015), PartitionDirectory([2011],s3a://mybucket/reddit_year/year=2011)))
16/10/14 17:23:48 INFO ListingFileCatalog: Selected 1 partitions out of 9, pruned 88.88888888888889% partitions.
You can see this in the logical plan if you run an explain(True)
on your query:
spark.sql("select created_utc, score, name from reddit where year = '2014'").explain(True)
This will show you the plan and you can see that it is filtering at the bottom of the plan:
+- *BatchedScan parquet [created_utc#58,name#65,score#69L,year#74] Format: ParquetFormat, InputPaths: s3a://mybucket/reddit_year, PartitionFilters: [isnotnull(year#74), (cast(year#74 as double) = 2014.0)], PushedFilters: [], ReadSchema: struct<created_utc:string,name:string,score:bigint>