I'm using Java Spark and I have 1 Dataframe like this
+---+-----+------+
|id |color|datas |
+----------------+
|1 |blue |data1|
|1 |red |data2|
|1 |orange|data3|
|2 |black |data4|
|2 | |data5|
|2 |yellow| |
|3 |white |data7|
|3 | |data8|
+----------------+
I need to modify this dataframe to look like this :
+---+--------------------+---------------------+
|id |color |datas |
+----------------------------------------------+
|1 |[blue, red, orange] |[data1, data2, data3]|
|2 |[black, yellow] |[data4, data5] |
|3 |[white] |[data7, data8] |
+----------------------------------------------+
I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.
I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.
Thank you
Edit : I'm using Spark 1.6