Multiple columns aggregation in Spark/Scala

Question

I have a Spark Dataset with numerous columns:

val df = Seq(
  ("a", 2, 3, 5, 3, 4, 2, 6, 7, 3),
  ("a", 1, 1, 2, 4, 5, 7, 3, 5, 2),
  ("b", 5, 7, 3, 6, 8, 8, 9, 4, 2),
  ("b", 2, 2, 3, 5, 6, 3, 2, 4, 8),
  ("b", 2, 5, 5, 4, 3, 6, 7, 8, 8),
  ("c", 1, 2, 3, 4, 5, 6, 7, 8, 9)
).toDF("id", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9")

Now I'd like to do a groupBy over id and get the sum of each p-column for each id.

Currently I'm doing the following:

val dfg =
  df.groupBy("id")
    .agg(
      sum($"p1").alias("p1"),
      sum($"p2").alias("p2"),
      sum($"p3").alias("p3"),
      sum($"p4").alias("p4"),
      sum($"p5").alias("p5"),
      sum($"p6").alias("p6"),
      sum($"p7").alias("p7"),
      sum($"p8").alias("p8"),
      sum($"p9").alias("p9")
    )

Which produces the (correct) output:

+---+---+---+---+---+---+---+---+---+---+
| id| p1| p2| p3| p4| p5| p6| p7| p8| p9|
+---+---+---+---+---+---+---+---+---+---+
|  c|  1|  2|  3|  4|  5|  6|  7|  8|  9|
|  b|  9| 14| 11| 15| 17| 17| 18| 16| 18|
|  a|  3|  4|  7|  7|  9|  9|  9| 12|  5|
+---+---+---+---+---+---+---+---+---+---+

Question is, in reality I have several dozens p-columns like that and I'd like to be able to write the aggregation in a more concise way.

Based on the answers to this question, I've tried to do the following:

val pcols = List.range(1, 10)
val ops = pcols.map(k => sum(df(s"p$k")).alias(s"p$k"))
val dfg =
  df.groupBy("id")
    .agg(ops: _*)  // does not compile — agg does not accept *-parameters

Unfortunately, unlike select(), agg() does not seem to accept *-parameters and so this doesn't work, producing a compile-time no ': _*' annotation allowed here error.

score 2 · Answer 1 · answered Sep 21 '17 at 11:39

2

agg has this signature: def agg(expr: Column, exprs: Column*): DataFrame

So try this:

df.groupBy("id")
    .agg(ops.head,ops.tail:_*)

answered Sep 21 '17 at 11:39

Raphael Roth

26,751
15
88
145

Multiple columns aggregation in Spark/Scala

1 Answers1