Spark dataset custom partitioner

Question

Could you please help me to find Java API for repartitioning sales dataset to N patitions of equal-size? By equal-size I mean equal number of rows.

Dataset<Row> sales = sparkSession.read().parquet(salesPath);
sales.toJavaRDD().partitions().size(); // returns 1

Possible duplicate of [How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?](http://stackoverflow.com/questions/23127329/how-to-define-custom-partitioner-for-spark-rdds-of-equally-sized-partition-where) — Bradley Kaiser, Feb 07 '17 at 19:13
@BradleyKaiser no, I am sure that answer is bad because of two reasons: 1) the answer show code of partitioner, but do not point how to pass custom partitioner to RDD API, that actually a question 2) the partitioner code is in Scala, Scala API may differ from Java API — VB_, Feb 08 '17 at 12:27

score 3 · Answer 1 · answered Feb 24 '17 at 04:34

AFAIK custom partitioners are not supported for Datasets. The whole idea of Dataset and Dataframe APIs in Spark 2+ is to abstract away the need to meddle with custom partitioners. And so if we face the need to deal with Data-skew and come to a point where custom partitioner is the only option, I guess we would go to lower level RDD manipulation.

For eg: Facebook use-case-study and Spark summit talk related to the use-case-study

For defining partitioners for RDDs, it is well documented in the API doc

Spark dataset custom partitioner

1 Answers1