3

I'm using Java Spark and I have 1 Dataframe like this

+---+-----+------+
|id |color|datas |
+----------------+
|1  |blue  |data1|
|1  |red   |data2|
|1  |orange|data3|
|2  |black |data4|
|2  |      |data5|
|2  |yellow|     |
|3  |white |data7|
|3  |      |data8|
+----------------+

I need to modify this dataframe to look like this :

+---+--------------------+---------------------+
|id |color               |datas                |
+----------------------------------------------+
|1  |[blue, red, orange] |[data1, data2, data3]|
|2  |[black, yellow]     |[data4, data5]       |
|3  |[white]             |[data7, data8]       |
+----------------------------------------------+

I want to merge the data to create an 'array' of the same column but from differents rows based on the 'id' column.

I'm able to do it throught UserDefinedAggregateFunction but I can only do it one column at a time and it takes too much time to process.

Thank you

Edit : I'm using Spark 1.6

zero323
  • 322,348
  • 103
  • 959
  • 935
Lucien
  • 99
  • 1
  • 7

2 Answers2

2

you can group by "id" and then use collect_list function to get the aggregated values.

dataframe.groupBy("id").agg(collect_list(struct("color")).as("color"), collect_list(struct("dates")).as("dates") )

Hope this helps

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • Thanks for the fast answer but unfortunately, I got a "org.apache.spark.sql.AnalysisException: undefined function collect_list" when I try with collect_list I'm using Spark 1.6 – Lucien May 23 '17 at 12:05
  • did you import the function?? import static org.apache.spark.sql.functions.*; – koiralo May 23 '17 at 12:08
  • yes I did, I found this answer for the undefined function https://stackoverflow.com/questions/35324049/how-do-i-collect-a-list-of-strings-from-spark-dataframe-column-after-a-groupby-o I'm trying to solve it – Lucien May 23 '17 at 12:19
  • if you import function from spark sql it should work. – koiralo May 23 '17 at 12:36
0

The actual function that works for me is:

dataframe.groupBy("id").agg(collect_list("color").as("color"), collect_list("date").as("date") ) dataframe.createOrReplaceTempView("dataframe")

Then create a new query where you can use the struct()

dffinal = spark.sql(s"""SELECT struct(a.color) AS colors, struct(a.date) AS dates FROM dataframe a """)