1

I have two RDD's - RDD1 and RDD2 with following structure:

RDD1:

[(u'abc', 1.0), (u'cde', 1.0),....]

RDD2:

[3.0, 0.0,....]

Now I want to form a third RDD which values from each each index of the above two RDD's together. So the above output should become:

RDD3:

[(u'abc', 1.0,3.0), (u'cde', 1.0,0.0),....]

As you can see that values from RDD2 got added to tuples of RDD1. How can I do that? I tried to do RDD3 = RDD1.map(lambda x:x).zip(RDD2) but it produces this output - [((u'abc', 1.0),3.0), ((u'cde', 1.0),0.0),....] which is not what I want as you can see there's a separation between values of RDD1 and RDD2 by ().

NOTE: My RDD1 was formed using - RDD1 = data.map(lambda x:(x[0])).zip(val)

zero323
  • 322,348
  • 103
  • 959
  • 935
user2966197
  • 2,793
  • 10
  • 45
  • 77

1 Answers1

4

You can simply reshape your data after zipping:

rdd1 = sc.parallelize([(u'abc', 1.0), (u'cde', 1.0)])
rdd2 = sc.parallelize([3.0, 0.0])

rdd1.zip(rdd2).map(lambda t: (t[0][0], t[0][1], t[1]))

In Python 2 it is possible to use:

rdd1.zip(rdd2).map(lambda ((x1, x2), y): (x1, x2, y))

but it is no longer supported in Python 3.

If you have more values to extract using indices may be tedious

lambda t: (t[0][0], t[0][1], t[0][2], ..., t[1]))

so you can experiment with something like this:

lambda t: tuple(list(t[0]) + [t[1]])

or implement a more sophisticated solution like here: Flatten (an irregular) list of lists

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935