According to the docs of Spark 1.6.3, repartition(partitionExprs: Column*)
should preserve the number of partitions in the resulting dataframe:
Returns a new DataFrame partitioned by the given partitioning expressions preserving the existing number of partitions
(taken from https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.sql.DataFrame)
But the following example seems to show something else (note that spark-master is local[4]
in my case):
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[4]"))
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val myDF = sc.parallelize(Seq(1,1,2,2,3,3)).toDF("x")
myDF.rdd.getNumPartitions // 4
myDF.repartition($"x").rdd.getNumPartitions // 200 !
How can that be explained? I'm using Spark 1.6.3 as a standalone application (i.e. running locally in IntelliJ IDEA)
Edit: This question does not adress the issue from Dropping empty DataFrame partitions in Apache Spark (i.e. how to repartiton along a column without producing empty partitions), but why the docs say something different from what I observe in my example