I need to aggregate the values of a column articleId
to an array. This needs to be done within a group which i create per groupBy
beforehand.
My table looks the following:
| customerId | articleId | articleText | ...
| 1 | 1 | ... | ...
| 1 | 2 | ... | ...
| 2 | 1 | ... | ...
| 2 | 2 | ... | ...
| 2 | 3 | ... | ...
And I want to build something like
| customerId | articleIds |
| 1 | [1, 2] |
| 2 | [1, 2, 3] |
My code so far:
DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));
But here I get an AnalysisException
:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Can someone help to build a correct statement?