4

I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:

select
  coalesce(a, "All A") as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by rollup (
  colA,
  colB
)
order by
  colA asc,
  colB asc

I'd then expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| All A | All B |  300 |
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:

select
  a as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by 
  a,
  rollup (b)
order by
  colA asc,
  colB asc

And I'd expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.

I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.

Thanks in advance, and please let me know if I can clarify anything.

Keith
  • 65
  • 7
  • 1
    You may be able to do it with `GROUPING SETS` https://stackoverflow.com/questions/40923680/how-to-grouping-sets-as-operator-method-on-dataset. Available in SQL only. – mazaneicha Jul 01 '20 at 22:36

1 Answers1

4

Try this GROUPING SETS option:

%sql
SELECT 
  COALESCE( a, 'all a' ) a, 
  COALESCE( b, 'all b' ) b, 
  SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b

My results (with updated numbers):

My results

wBob
  • 13,710
  • 3
  • 20
  • 37
  • I believe `GROUPING SETS` is indeed what I need here. Thank you very much. The syntax you've provided doesn't _quite_ achieve the desired output (it looks like you're doing something semantically equivalent to `GROUP BY ROLLUP`). I'll experiment later this week when I have time again, and once I have the right syntax, I'll follow up. From there, if you update your answer to match, I'll mark it as accepted. – Keith Jul 02 '20 at 18:41
  • 1
    Thanks Keith - I was guessing a bit as your numbers don’t quite add up but hopefully you get the idea. – wBob Jul 02 '20 at 18:44
  • 1
    Sorry about that—the numbers were just placeholders. I'll update the question to make them clearer. – Keith Jul 02 '20 at 18:53
  • Thanks for updating your answer—your output is what I'm looking for, but I don't understand why. I guess I don't yet understand what `GROUPING SETS` is doing under the hood well enough. The actual query I'm working on has five columns in the `GROUP BY`, but only the fourth column needs the additional aggregation, so I'm trying to figure out how to apply the abstracted version in this question to the actual query. – Keith Jul 03 '20 at 00:43
  • 1
    How are you getting on? `GROUPING SETS` is standard ANSI SQL so you should be able to read about it and how it works. The way I think of it is, grouping sets can add extra summary rows to your result and you control what those rows are. Two brackets () is shorthand for "summarise everything"; there's some other notes [here](https://stackoverflow.com/questions/25274879/when-to-use-grouping-sets-cube-and-rollup). – wBob Jul 05 '20 at 11:18
  • 1
    I've had the opportunity to use `GROUPING SETS` in a few other contexts now, and it's beginning to make a lot more sense with practice. Thank you again for pointing me in the right direction! – Keith Jul 29 '20 at 19:18