I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:
select
coalesce(a, "All A") as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by rollup (
colA,
colB
)
order by
colA asc,
colB asc
I'd then expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| All A | All B | 300 |
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:
select
a as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by
a,
rollup (b)
order by
colA asc,
colB asc
And I'd expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException
when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.
I know that I can accomplish this with union all
, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.
Thanks in advance, and please let me know if I can clarify anything.