Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

Question

Using Spark 2.4.0. My production data is extremely skewed, so one of the tasks was taking 7x longer than everything else. I tried different strategies to normalize the data so that all executors worked equally -

spark.default.parallelism
reduceByKey(numPartitions)
repartition(numPartitions)

My expectation was that all three of them should evenly partition, however playing with some dummy non-production data on Spark Local/Standalone suggests that options 1,2 normalize better than 3.

Data as below : (and i am trying to do a simple reduce on balance per account+ccy combination

account}date}ccy}amount
A1}2020/01/20}USD}100.12
A2}2010/01/20}SGD}200.24
A2}2010/01/20}USD}300.36
A1}2020/01/20}USD}400.12

Expected result should be [A1-USD,500.24], [A2-SGD,200.24], [A2-USD,300.36] Ideally these should be partitioned in 3 different partitions.

javaRDDWithoutHeader
.mapToPair((PairFunction<Balance, String, Integer>) balance -> new Tuple2<>(balance.getAccount() + balance.getCcy(), 1))        
    .mapToPair(new MyPairFunction())
   .reduceByKey(new ReductionFunction())

Code to check partitions

     System.out.println("b4 = " +pairRDD.getNumPartitions());
     System.out.println(pairRDD.glom().collect());
     JavaPairRDD<DummyString, BigDecimal> newPairRDD = pairRDD.repartition(3);
     System.out.println("Number of partitions = " +newPairRDD.getNumPartitions());
     System.out.println(newPairRDD.glom().collect());

Option 1: Doing nothing
Option 2: Setting spark.default.parallelism to 3
Option 3: reduceByKey with numPartitions = 3
Option 4: repartition(3)

For Option 1 Number of partitions = 2 [ [(DummyString{account='A2', ccy='SGD'},200.24), (DummyString{ account='A2', ccy='USD'},300.36)], [(DummyString{account='A1', ccy='USD'},500.24)] ]

For option 2

Number of partitions = 3 [ [(DummyString{account='A1', ccy='USD'},500.24)], [(DummyString{account='A2', ccy='USD'},300.36)], [(DummyString{account='A2', ccy='SGD'},200.24)]]

For option 3 Number of partitions = 3 [ [(DummyString{account='A1', ccy='USD'},500.24)], [(DummyString{account='A2', ccy='USD'},300.36)], [(DummyString{account='A2', ccy='SGD'},200.24)] ]

For option 4 Number of partitions = 3 [[], [(DummyString{ account='A2', ccy='SGD'},200.24)], [(DummyString{ account='A2', ccy='USD'},300.36), (DummyString{ account='A1', ccy='USD'},500.24)]]

Conclusion : options 2(spark.default.parallelism) and 3(reduceByKey(numPartitions) normalized much better than option 4 (repartition) Fairly deterministic results, never saw option4 normalize into 3 partitions.

Question :

is reduceByKey(numPartitions) much better than repartition or
is this just because the sample data set is so small ? or
is this behavior going to be different when we submit via a YARN cluster

Can you show the Spark UI stuff. the 250 is on the input side I suspect. What is 7x longer in absolute terms? — thebluephantom, May 19 '20 at 17:17
Which Stage? Why do you think hashing should all be in separate buckets? Hashes outcomes are what they are. — thebluephantom, May 19 '20 at 17:23
See https://stackoverflow.com/questions/43027306/is-there-an-effective-partitioning-method-when-using-reducebykey-in-spark — thebluephantom, May 19 '20 at 19:40

score 0 · Answer 1 · answered May 19 '20 at 19:50

I think there a few things running through the question and therefore harder to answer.

Firstly there are the partitioning and parallelism related to the data at rest and thus when read in; without re-boiling the ocean, here is an excellent SO answer that addresses this: How spark read a large file (petabyte) when file can not be fit in spark's main memory. In any event, there is no hashing or anything going on, just "as is".

Also, RDDs are not well optimized compared to DFs.

Various operations in Spark cause shuffling after an Action invoked:

reduceByKey will cause less shuffling, using hashing for final aggregations and local partition aggregation which is more efficient
repartition as well, that uses randomness
partitionBy(new HashPartitioner(n)), etc. which you do not allude to
reduceByKey(aggr. function, N partitions) which oddly enough appears to be more efficient than a repartition firstly

Your latter comment alludes to data skewness, typically. Too many entries hash to the same "bucket" / partition for the reduceByKey. Alleviate by:

In general try with a larger number of partitions up front (when reading in) - but I cannot see your transforms, methods here, so we leave this as general advice.
In general try with a larger number of partitions up front (when reading in) using suitable hashing - but I cannot see your transforms, methods here, so we leave this as general advice.
Or in some cases "salt" the key by adding a suffix and then reduceByKey and reduceByKey again to "unsalt" to get the original key. Depends on extra time taken vs. leaving as is or performing the other options.
repartition(n) applies random ordering, so you shuffle and then need to shuffle again. Unnecessarily imo. As another post shows (see comments on your question) it looks like unnecessary work done, but these are old style RDDs.

All easier to do with dataframes BTW.

As we are not privy to your complete coding, hope this helps.

Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

1 Answers1