I am trying to learn CombineByKey with Pyspark. Basically just recreating groupByKey using CombineByKey.
My Rdd has objects like so, stored in pairRdds. Each pair is an object stored in a class (MyClass) variable myObj.
myObj example: obj1 = ((a, b),(A,0,0))
Where the key is (a,b)
and the value is (A,0,0)
Say my example RDD looks like:
Rdd = [((a, b),(A,3,0)), ((a, b),(B,2,7)), ((a, c),(C,5,2)), ((a, d),(D,8,6))]
My final output I would like to look like:
Output = [((a, b),[(A,3,0), (B,2,7)]),((a, c),(C,5,2)), ((a, d),(D,8,6))]
Following some of the examples: `combineByKey`, pyspark and Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
comb = cell_location_flat.combineByKey(
lambda row: [row],
lambda rows, row: rows + [row],
lambda rows1, rows2: rows1 + rows2,
)
print(comb.collect())
I get the following error.
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 352, in func
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1861, in combineLocally
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
for k, v in iterator:
TypeError: 'MyClass' object is not iterable
Any ideas on what I am doing wrong? Thanks for any replies!