I have the following rdd data.
[(13, 'Munich@en'), (13, 'Munchen@de'), (14, 'Vienna@en'), (14, 'Wien@de'),(15, 'Paris@en')]
I want to combine the above rdd , using reduceByKey method, that would result the following output, i.e to join the entries into a dictionary based on entry's language.
[
(13, {'en':'Munich','de':'Munchen'}),
(14, {'en':'Vienna', 'de': 'Wien'}),
(15, {'en':'Paris', 'de':''})
]
The examples for reduceByKey
were all numerical operations such as addition, so I am not very sure how to go about updating a dictionary in each reduce step.
This is my code:
rd0 = sc.parallelize(
[(13, 'munich@en'),(13, 'munchen@de'), (14, 'Vienna@en'),(14,'Wien@de'),(15,'Paris@en')]
)
def updateDict(x,xDict):
xDict[x[:-3]]=x[-2:]
rd0.map(lambda x: (x[0],(x[1],{'en':'','de':''}))).reduceByKey(updateDict).collect()
I am getting the following error message but not sure what I am doing wrong.
return f(*args, **kwargs)
File "<ipython-input-209-16cfa907be76>", line 2, in ff
TypeError: 'tuple' object does not support item assignment
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)