0

I have an HIVE external table that I created through Spark's DataWriter interface. It is about 700GB and has about 5000 partitions and in each partition, is bucketed into 50 buckets. I am using Spark 2.2.1.

[Sort Columns,[`unique_id`],], [Bucket Columns,[`unique_id`],], [Num Buckets,50,], 
[Provider,parquet,],
[Type,EXTERNAL,]

These are the other relevant parameters set:

spark.sql("SET spark.default.parallelism=1000")
spark.sql("set spark.sql.shuffle.partitions=500")
spark.sql("set spark.sql.files.maxPartitionBytes=134217728")

I have not changed the HDFS block size defaults.

$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize
0
$ hdfs getconf -confKey dfs.blocksize
134217728
$ hdfs getconf -confKey mapreduce.job.maps
32

Now when I read this table, it's RDD count is being set to only 50.

scala> spark.table("table1").rdd.partitions.size
res25: Int = 50

How do I increase the parallelism here? Why isn't spark able to figure out the (horizontal) partition paths from the metastore, and figure out the appropriate partition count for the RDD like it does with plain parquet?

  • Possible duplicate of [How to change partition size in Spark SQL](https://stackoverflow.com/questions/38249624/how-to-change-partition-size-in-spark-sql) – Alper t. Turker Feb 14 '18 at 14:28
  • This question is specific to a hive EXTERNAL table and Spark 2.2+. Have updated the question with the answer suggested in that question too. – pnpranavrao Feb 14 '18 at 15:21
  • Also added the HDFS options suggested in this similar question - https://stackoverflow.com/questions/44061443/how-does-spark-sql-decide-the-number-of-partitions-it-will-use-when-loading-data. They don't work. – pnpranavrao Feb 14 '18 at 15:21

0 Answers0