I am working with a large dataset, that is partitioned by two columns - plant_name
and tag_id
. The second partition - tag_id
has 200000 unique values, and I mostly access the data by specific tag_id
values. If I use the following Spark commands:
sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true")
sqlContext.setConf("spark.sql.parquet.filterPushdown", "true")
val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'")
I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in Spark it runs for hours.
The actual data is held in a S3 bucket, and when I submit the sql query, Spark goes off and first gets all the partitions from the Hive metastore (200000 of them), and then calls refresh()
to force a full status list of all these files in the S3 object store (actually calling listLeafFilesInParallel
).
It is these two operations that are so expensive, are there any settings that can get Spark to prune the partitions earlier - either during the call to the metadata store, or immediately afterwards?