-1

I have an RDD with (key, list[word1, word2, word3]) and i want to transform this to (key, word1), (key, word2)... (key, word-n), can anyone point me in the right direction on how to solve this?

fomox
  • 9
  • 5
  • 2
    Possible duplicate of [What is the difference between map and flatMap and a good use case for each?](https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each) – David Apr 23 '18 at 13:01
  • What have you tried so far? – abukaj Apr 23 '18 at 13:06

2 Answers2

1

Use list comprehension:

key, list_ = ('key', ['word1', 'word2', 'word3'])
result = [(key, item) for item in list_]
print(result)

Output:

[('key', 'word1'), ('key', 'word2'), ('key', 'word3')]

You can apply this solution to your rdd using flatMap():

myrdd = sc.parallelize([('key', ['word1', 'word2', 'word3'])])
myrdd.flatMap(lambda row: [(row[0], item) for item in row[1]]).collect()
#[('key', 'word1'), ('key', 'word2'), ('key', 'word3')]
pault
  • 41,343
  • 15
  • 107
  • 149
Ivan Vinogradov
  • 4,269
  • 6
  • 29
  • 39
0

Use a list comprehension, iterate through tuple and associate first element with every item in the second element:

>>> tupl = ('key', ['word1', 'word2', 'word3'])  
>>> [(tupl[0], tupl[1][i]) for i in range(len(tupl[1]))]
[('key', 'word1'), ('key', 'word2'), ('key', 'word3')]

You can apply this solution to your rdd using flatMap():

myrdd = sc.parallelize([('key', ['word1', 'word2', 'word3'])])
myrdd.flatMap(lambda tupl: [(tupl[0], tupl[1][i]) for i in range(len(tupl[1]))]).collect()
#[('key', 'word1'), ('key', 'word2'), ('key', 'word3')]
pault
  • 41,343
  • 15
  • 107
  • 149
Austin
  • 25,759
  • 4
  • 25
  • 48