2

According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*) should preserve the number of partitions in the resulting dataframe:

Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions

(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)

But the following example seems to show something else (note that spark-master is local[4] in my case):

val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._

val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4 
myDF.repartition($"x").rdd.getNumPartitions //  200 !

How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)

Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example

Community
  • 1
  • 1
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145

1 Answers1

1

It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:

myDF.explain

val repartitionedDF = myDF.repartition($"x")

repartitionedDF.explain
FaigB
  • 2,271
  • 1
  • 13
  • 22
  • In shuffle also using hash the number of partitions will increase according number of mapper and reducer tasks. – FaigB Jan 26 '17 at 13:57