Use of reduceByKey instead of leftOuterJoin in Spark for faster processing

Question

I have the followings rdds that I want to join them using leftOuterJoin. I was wondering if reduceByKey would be more efficient/faster than leftOuterJoin.

rd0= sc.parallelize([ ('s1', 'o1' ),("s1", 'o2' ),('s2','o2'),("s3",'o3')])
rd1= sc.parallelize([ ('s1', 'oo1' ),("s10", 'oo10' ),('s2','oo2')])
reduceByKeyMethod
rd00 = rd0.map(lambda x:(x[0],([x[1]],[])))
rd11 = rd1.map(lambda x:(x[0],([],[x[1]])))
rd00.union(rd11).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).collect()
Out[22]:
[('s1', (['o1'], [])),
 ('s1', (['o2'], [])),
 ('s2', (['o2'], [])),
 ('s3', (['o3'], [])),
 ('s1', ([], ['oo1'])),
 ('s10', ([], ['oo10'])),
 ('s2', ([], ['oo2']))]

vs using leftOuterJoin directly rd0.leftOuterJoin(rd1)
Will using reduceByKey be faster for large rd0 and rd1 datasets?

You can try to [time each method and test](https://stackoverflow.com/a/50289936/5858851). — pault, Nov 14 '19 at 19:29

score 2 · Accepted Answer · answered Nov 14 '19 at 20:44

If we check execution plan for both approaches => There Should be No Difference

As shown using toDebugString

print(rd00.union(rd11).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).toDebugString())

Prints

(4) PythonRDD[15] at RDD at PythonRDD.scala:49 []
 |  MapPartitionsRDD[14] at mapPartitions at PythonRDD.scala:129 []
 |  ShuffledRDD[13] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(4) PairwiseRDD[12] at reduceByKey at <stdin>:1 []
    |  PythonRDD[11] at reduceByKey at <stdin>:1 []
    |  UnionRDD[10] at union at NativeMethodAccessorImpl.java:0 []
    |  PythonRDD[2] at RDD at PythonRDD.scala:49 []
    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:184 []
    |  PythonRDD[3] at RDD at PythonRDD.scala:49 []
    |  ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:184 []

And leftOuterJoin

print(rd00.leftOuterJoin(rd11).toDebugString())

Prints

(4) PythonRDD[23] at RDD at PythonRDD.scala:49 []
 |  MapPartitionsRDD[22] at mapPartitions at PythonRDD.scala:129 []
 |  ShuffledRDD[21] at partitionBy at NativeMethodAccessorImpl.java:0 []
 +-(4) PairwiseRDD[20] at leftOuterJoin at <stdin>:1 []
    |  PythonRDD[19] at leftOuterJoin at <stdin>:1 []
    |  UnionRDD[18] at union at NativeMethodAccessorImpl.java:0 []
    |  PythonRDD[16] at RDD at PythonRDD.scala:49 []
    |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:184 []
    |  PythonRDD[17] at RDD at PythonRDD.scala:49 []
    |  ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:184 []

I couldn't understand , both EP using different methods . Same is written in line 4 from top. — PIG, Nov 15 '19 at 06:57

Use of reduceByKey instead of leftOuterJoin in Spark for faster processing

1 Answers1