How to flatten tuple created using zip transformation in PySpark

Question

I have two RDD's - RDD1 and RDD2 with following structure:

RDD1:

[(u'abc', 1.0), (u'cde', 1.0),....]

RDD2:

[3.0, 0.0,....]

Now I want to form a third RDD which values from each each index of the above two RDD's together. So the above output should become:

RDD3:

[(u'abc', 1.0,3.0), (u'cde', 1.0,0.0),....]

As you can see that values from RDD2 got added to tuples of RDD1. How can I do that? I tried to do RDD3 = RDD1.map(lambda x:x).zip(RDD2) but it produces this output - [((u'abc', 1.0),3.0), ((u'cde', 1.0),0.0),....] which is not what I want as you can see there's a separation between values of RDD1 and RDD2 by ().

NOTE: My RDD1 was formed using - RDD1 = data.map(lambda x:(x[0])).zip(val)

@Marcin I did `RDD3 = RDD1.map(lambda x:x).zip(RDD2)` like I have mentioned in my post above but that does not produce the desired output — user2966197, Aug 17 '15 at 19:32
*subsequent*; also your map is a no-op as it applies the identity transformation. — Marcin, Aug 17 '15 at 21:01

score 4 · Accepted Answer · edited May 23 '17 at 10:30

You can simply reshape your data after zipping:

rdd1 = sc.parallelize([(u'abc', 1.0), (u'cde', 1.0)])
rdd2 = sc.parallelize([3.0, 0.0])

rdd1.zip(rdd2).map(lambda t: (t[0][0], t[0][1], t[1]))

In Python 2 it is possible to use:

rdd1.zip(rdd2).map(lambda ((x1, x2), y): (x1, x2, y))

but it is no longer supported in Python 3.

If you have more values to extract using indices may be tedious

lambda t: (t[0][0], t[0][1], t[0][2], ..., t[1]))

so you can experiment with something like this:

lambda t: tuple(list(t[0]) + [t[1]])

or implement a more sophisticated solution like here: Flatten (an irregular) list of lists

How to flatten tuple created using zip transformation in PySpark

1 Answers1