Suppose my SQL dataframe df
is like this:
| id | v1 | v2 |
|----+----+----|
| 1 | 0 | 3 |
| 1 | 0 | 3 |
| 1 | 0 | 8 |
| 4 | 1 | 2 |
I want the output to be:
| id | v1 | list(v2) |
|----+----+--------------|
| 1 | [0] | [3,3,8] |
| 4 | [1] | [2] |
What is the most simple way of doing this with SQL dataframe without Hive?
1) Apparently, with Hive support one could simply use collect_set()
and collect_list()
aggregate functions. But these functions do not work in plain Spark SqlContext.
2) An other way would be to make an UDAF, but given the amount of code needed, this seems overkill for such a simple aggregation.
3) I could use df.rdd and then use the groupBy()
function. This is my last resort. I actually converted the RDD to DF to make data manipulations easier, but apparently not...
Are there any other simple ways that I missed?