0

I need to aggregate the values of a column articleId to an array. This needs to be done within a group which i create per groupBy beforehand.

My table looks the following:

| customerId | articleId | articleText | ...
|    1       |     1     |   ...       | ...
|    1       |     2     |   ...       | ...
|    2       |     1     |   ...       | ...
|    2       |     2     |   ...       | ...
|    2       |     3     |   ...       | ...

And I want to build something like

| customerId |  articleIds |
|    1       |  [1, 2]     |
|    2       |  [1, 2, 3]  |    

My code so far:

DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));

But here I get an AnalysisException:

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

Can someone help to build a correct statement?

D. Müller
  • 3,336
  • 4
  • 36
  • 84

2 Answers2

0

For SQL syntax, when you want to group by something, you must to include this "something" in select statement. Maybe in your sparkSQL code, it's not indicated this point.

You have a similar question so I think it's the solution for your problem SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

Community
  • 1
  • 1
minh-hieu.pham
  • 1,029
  • 2
  • 12
  • 21
0

This can be achieved using collect_list function, but it's available only if you're using HiveContext:

import org.apache.spark.sql.functions._

df.groupBy("customerId").agg(collect_list("articleId"))
Paweł Jurczenko
  • 4,431
  • 2
  • 20
  • 24