Spark SQL: Aggregate column values within a Group

Question

I need to aggregate the values of a column articleId to an array. This needs to be done within a group which i create per groupBy beforehand.

My table looks the following:

| customerId | articleId | articleText | ...
|    1       |     1     |   ...       | ...
|    1       |     2     |   ...       | ...
|    2       |     1     |   ...       | ...
|    2       |     2     |   ...       | ...
|    2       |     3     |   ...       | ...

And I want to build something like

| customerId |  articleIds |
|    1       |  [1, 2]     |
|    2       |  [1, 2, 3]  |

My code so far:

DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));

But here I get an AnalysisException:

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

Can someone help to build a correct statement?

Do you use `SQLContext` or `HiveContext`? – Paweł Jurczenko Jul 11 '16 at 10:55 — Paweł Jurczenko, Jul 11 '16 at 10:55

score 0 · Answer 1 · edited May 23 '17 at 11:58

0

For SQL syntax, when you want to group by something, you must to include this "something" in select statement. Maybe in your sparkSQL code, it's not indicated this point.

You have a similar question so I think it's the solution for your problem SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

edited May 23 '17 at 11:58

Community

1
1

answered Jul 11 '16 at 10:29

minh-hieu.pham

1,029
2
12
21

score 0 · Answer 2 · answered Jul 11 '16 at 10:58

0

This can be achieved using collect_list function, but it's available only if you're using HiveContext:

import org.apache.spark.sql.functions._

df.groupBy("customerId").agg(collect_list("articleId"))

answered Jul 11 '16 at 10:58

Paweł Jurczenko

4,431
2
20
24

Spark SQL: Aggregate column values within a Group

2 Answers2