I have the followings rdds that I want to join them using leftOuterJoin. I was wondering if reduceByKey would be more efficient/faster than leftOuterJoin.
rd0= sc.parallelize([ ('s1', 'o1' ),("s1", 'o2' ),('s2','o2'),("s3",'o3')])
rd1= sc.parallelize([ ('s1', 'oo1' ),("s10", 'oo10' ),('s2','oo2')])
reduceByKeyMethod
rd00 = rd0.map(lambda x:(x[0],([x[1]],[])))
rd11 = rd1.map(lambda x:(x[0],([],[x[1]])))
rd00.union(rd11).reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])).collect()
Out[22]:
[('s1', (['o1'], [])),
('s1', (['o2'], [])),
('s2', (['o2'], [])),
('s3', (['o3'], [])),
('s1', ([], ['oo1'])),
('s10', ([], ['oo10'])),
('s2', ([], ['oo2']))]
vs using leftOuterJoin directly rd0.leftOuterJoin(rd1)
Will using reduceByKey be faster for large rd0 and rd1 datasets?