When decreasing the number of partitions one can use coalesce
, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition
induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD
with balanceSlack = 1.0
- so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations
... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions)
on my SparkConf
, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ...
, which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.