I have an HIVE external table that I created through Spark's DataWriter interface. It is about 700GB and has about 5000 partitions and in each partition, is bucketed into 50 buckets. I am using Spark 2.2.1.
[Sort Columns,[`unique_id`],], [Bucket Columns,[`unique_id`],], [Num Buckets,50,],
[Provider,parquet,],
[Type,EXTERNAL,]
These are the other relevant parameters set:
spark.sql("SET spark.default.parallelism=1000")
spark.sql("set spark.sql.shuffle.partitions=500")
spark.sql("set spark.sql.files.maxPartitionBytes=134217728")
I have not changed the HDFS block size defaults.
$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize
0
$ hdfs getconf -confKey dfs.blocksize
134217728
$ hdfs getconf -confKey mapreduce.job.maps
32
Now when I read this table, it's RDD count is being set to only 50.
scala> spark.table("table1").rdd.partitions.size
res25: Int = 50
How do I increase the parallelism here? Why isn't spark able to figure out the (horizontal) partition paths from the metastore, and figure out the appropriate partition count for the RDD like it does with plain parquet?