9

I have two rdd's which both are result of a groupby and look like:

[(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])]

and

[(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])]

How can I merge the two and get the following:

[(u'1', [u'0',u'3', u'4']]), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1']),(u'0', [u'1', u'2'])]

I tried the join command but but that did not give me the result that I was looking for. Any help is much appreciated.

ahajib
  • 12,838
  • 29
  • 79
  • 120

2 Answers2

16

I solved it using:

rdd2.union(rdd1).reduceByKey(lambda x,y : x+y)

None of the following worked for me:

(rdd1 union rdd2).reduceByKey(_ ++ _)

or

rdd1.join(rdd2).map(case (k, (ls, rs)) => (k, ls ++ rs))

Best of luck to everyone.

ahajib
  • 12,838
  • 29
  • 79
  • 120
0
data1 = [(u'1', [u'0']), (u'3', [u'1']), (u'2', [u'0']), (u'4', [u'1'])]
data2 = [(u'1', [u'3', u'4']), (u'0', [u'1', u'2'])]

distData1 = sc.parallelize(data1)
distData2 = sc.parallelize(data2)
distData3 = distData1.leftOuterJoin(distData2)
distData4 = distData3.map(lambda rec : ( rec[0], rec[1][0] + [ ] if rec[1][1] is None else rec[1][1])
hellow
  • 12,430
  • 7
  • 56
  • 79
  • 1
    While this might answer the authors question, it lacks some explaining words and links to documentation. Raw code snippets are not very helpful without some phrases around it. You may also find [how to write a good answer](https://stackoverflow.com/help/how-to-answer) very helpful. Please edit your answer. – hellow Sep 28 '18 at 07:54