1

Could you please help me to find Java API for repartitioning sales dataset to N patitions of equal-size? By equal-size I mean equal number of rows.

Dataset<Row> sales = sparkSession.read().parquet(salesPath);
sales.toJavaRDD().partitions().size(); // returns 1
Shreeharsha
  • 914
  • 1
  • 10
  • 21
VB_
  • 45,112
  • 42
  • 145
  • 293
  • 2
    Possible duplicate of [How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?](http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where) – Bradley Kaiser Feb 07 '17 at 19:13
  • @BradleyKaiser no, I am sure that answer is bad because of two reasons: 1) the answer show code of partitioner, but do not point how to pass custom partitioner to RDD API, that actually a question 2) the partitioner code is in Scala, Scala API may differ from Java API – VB_ Feb 08 '17 at 12:27

1 Answers1

3

AFAIK custom partitioners are not supported for Datasets. The whole idea of Dataset and Dataframe APIs in Spark 2+ is to abstract away the need to meddle with custom partitioners. And so if we face the need to deal with Data-skew and come to a point where custom partitioner is the only option, I guess we would go to lower level RDD manipulation.

For eg: Facebook use-case-study and Spark summit talk related to the use-case-study

For defining partitioners for RDDs, it is well documented in the API doc

TheGT
  • 412
  • 8
  • 13