0

I have a data frame with a format like:

id | product
-------------
1  | A
1  | B
1  | C
2  | A
3  | A
3  | C 

What I want to accomplish is a 2 column data frame output where there is one row per ID with an array for every product owned by that ID. I tried some code with mapPartitions() but I get errors about not being able to infer schema. I know I have to yield something back in the map function, but I can't seem to figure it out.

Using Spark 1.6

Edit

In case anyone else has this question, I actually went with the solution here using combineByKey(): https://stackoverflow.com/a/27043562/1181412

It gave more flexibility to work the fields in a more granular way

Community
  • 1
  • 1
JohnB
  • 1,743
  • 5
  • 21
  • 40

1 Answers1

0

A little bit clunky but works

a = sqlContext.createDataFrame(sc.parallelize([
            (1, 'A'), (1, 'B'), (1, 'C'), 
            (2, 'A'), 
            (3, 'A'), (3, 'C')]), ['id', 'product']) 

sqlContext.createDataFrame(
    a.rdd.reduceByKey(lambda x, y: x + ',' + y), 
    ['id', 'products']).show()
TDrabas
  • 858
  • 6
  • 13