I'm using Spark 2.0.0 and dataframe. Here is my input dataframe as
| id | year | qty |
|----|-------------|--------|
| a | 2012 | 10 |
| b | 2012 | 12 |
| c | 2013 | 5 |
| b | 2014 | 7 |
| c | 2012 | 3 |
What I want is
| id | year_2012 | year_2013 | year_2014 |
|----|-----------|-----------|-----------|
| a | 10 | 0 | 0 |
| b | 12 | 0 | 7 |
| c | 3 | 5 | 0 |
or
| id | yearly_qty |
|----|---------------|
| a | [10, 0, 0] |
| b | [12, 0, 7] |
| c | [3, 5, 0] |
The closest solution I found is collect_list()
but this function doesn't provide order for the list. In my mind the solution should be like:
data.groupBy('id').agg(collect_function)
Is there a way to generate this without filtering every id out using a loop?