2

I am trying to learn CombineByKey with Pyspark. Basically just recreating groupByKey using CombineByKey.

My Rdd has objects like so, stored in pairRdds. Each pair is an object stored in a class (MyClass) variable myObj.

myObj example: obj1 = ((a, b),(A,0,0))

Where the key is (a,b) and the value is (A,0,0)

Say my example RDD looks like:

Rdd = [((a, b),(A,3,0)), ((a, b),(B,2,7)), ((a, c),(C,5,2)), ((a, d),(D,8,6))]

My final output I would like to look like:

Output = [((a, b),[(A,3,0), (B,2,7)]),((a, c),(C,5,2)), ((a, d),(D,8,6))]

Following some of the examples: `combineByKey`, pyspark and Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

comb = cell_location_flat.combineByKey(
        lambda row: [row],
        lambda rows, row: rows + [row],
        lambda rows1, rows2: rows1 + rows2,
    )

print(comb.collect())

I get the following error.

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 367, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2499, in pipeline_func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 352, in func
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1861, in combineLocally
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues
    for k, v in iterator:
TypeError: 'MyClass' object is not iterable

Any ideas on what I am doing wrong? Thanks for any replies!

sanjayr
  • 1,679
  • 2
  • 20
  • 41
  • 1
    Never would have thought it was an issue with the class. Thanks for pointing me to that link. Worked perfectly. – sanjayr Apr 27 '19 at 22:36

0 Answers0