Pyspark creating nested field by partition

Question

I have a data frame with a format like:

id | product
-------------
1  | A
1  | B
1  | C
2  | A
3  | A
3  | C

What I want to accomplish is a 2 column data frame output where there is one row per ID with an array for every product owned by that ID. I tried some code with mapPartitions() but I get errors about not being able to infer schema. I know I have to yield something back in the map function, but I can't seem to figure it out.

Using Spark 1.6

Edit

In case anyone else has this question, I actually went with the solution here using combineByKey(): https://stackoverflow.com/a/27043562/1181412

It gave more flexibility to work the fields in a more granular way

I'm actually going to go with this one as it seems to get me a step closer. Thanks! — JohnB, Jan 12 '17 at 21:56

score 0 · Answer 1 · answered Jan 12 '17 at 21:20

A little bit clunky but works

a = sqlContext.createDataFrame(sc.parallelize([
            (1, 'A'), (1, 'B'), (1, 'C'), 
            (2, 'A'), 
            (3, 'A'), (3, 'C')]), ['id', 'product']) 

sqlContext.createDataFrame(
    a.rdd.reduceByKey(lambda x, y: x + ',' + y), 
    ['id', 'products']).show()

Pyspark creating nested field by partition

1 Answers1