Spark DataFrame repartition : number of partition not preserved

Question

According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:

Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions

(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)

But the following example seems to show something else (note that spark-master is local[4] in my case):

val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4 
myDF.repartition($"x").rdd.getNumPartitions //  200 !

How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)

Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example

@FaigB Not sure which parameter you are referring to? I added the Spark-Conf in the Question — Raphael Roth, Jan 25 '17 at 15:10
Possible duplicate of [Dropping empty DataFrame partitions in Apache Spark](http://stackoverflow.com/questions/41854970/dropping-empty-dataframe-partitions-in-apache-spark) — user7337271, Jan 25 '17 at 20:26

FaigB · Answer 1 · 2017-01-26T12:33:14.940

1

It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:

myDF.explain

val repartitionedDF = myDF.repartition($"x")

repartitionedDF.explain

edited Jan 26 '17 at 12:33

answered Jan 25 '17 at 15:54

FaigB

2,271
1
13
22

In shuffle also using hash the number of partitions will increase according number of mapper and reducer tasks. – FaigB Jan 26 '17 at 13:57

Spark DataFrame repartition : number of partition not preserved

1 Answers1

Linked