Table:
id | val |category
----------------
a1 | 10 | A
a1 | 30 | B
a1 | 20 | C
a2 | 5 | A
a2 | 7 | B
a2 | 2 | C
a3 | 50 | C
a3 | 60 | B
a3 | 90 | A
Query:
SELECT max(val), id, category
FROM table
GROUP BY id;
I expect that this query will work on relational databases like MySQL, Oracle, MS SQL Server etc. But why is it not working on Spark?
Am I right to say "Spark has some limitations for using group by"? I tested the same table design on MySQL and it works perfectly, but it's giving me an error on Spark:
org.apache.spark.sql.AnalysisException: expression 'category' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
After looking into this error, Spark is suggesting to use function first() or function first_value() as a workaround. So I tried but I am not sure that the output is right.
- Is it because it's non-relational?
- Does the above specific case of GROUP BY not work on other non-relational DB's?
- Is there a workaround or alternative approach?
- Somewhere it was said that "Spark version above 2.0 will not have such problems". I am on version Spark 1.6, is it really true that Spark 2.0 won't have such issues?