I have a Spark Dataset with numerous columns:
val df = Seq(
("a", 2, 3, 5, 3, 4, 2, 6, 7, 3),
("a", 1, 1, 2, 4, 5, 7, 3, 5, 2),
("b", 5, 7, 3, 6, 8, 8, 9, 4, 2),
("b", 2, 2, 3, 5, 6, 3, 2, 4, 8),
("b", 2, 5, 5, 4, 3, 6, 7, 8, 8),
("c", 1, 2, 3, 4, 5, 6, 7, 8, 9)
).toDF("id", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9")
Now I'd like to do a groupBy
over id
and get the sum
of each p-column for each id
.
Currently I'm doing the following:
val dfg =
df.groupBy("id")
.agg(
sum($"p1").alias("p1"),
sum($"p2").alias("p2"),
sum($"p3").alias("p3"),
sum($"p4").alias("p4"),
sum($"p5").alias("p5"),
sum($"p6").alias("p6"),
sum($"p7").alias("p7"),
sum($"p8").alias("p8"),
sum($"p9").alias("p9")
)
Which produces the (correct) output:
+---+---+---+---+---+---+---+---+---+---+
| id| p1| p2| p3| p4| p5| p6| p7| p8| p9|
+---+---+---+---+---+---+---+---+---+---+
| c| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| b| 9| 14| 11| 15| 17| 17| 18| 16| 18|
| a| 3| 4| 7| 7| 9| 9| 9| 12| 5|
+---+---+---+---+---+---+---+---+---+---+
Question is, in reality I have several dozens p-columns like that and I'd like to be able to write the aggregation in a more concise way.
Based on the answers to this question, I've tried to do the following:
val pcols = List.range(1, 10)
val ops = pcols.map(k => sum(df(s"p$k")).alias(s"p$k"))
val dfg =
df.groupBy("id")
.agg(ops: _*) // does not compile — agg does not accept *-parameters
Unfortunately, unlike select()
, agg()
does not seem to accept *-parameters
and so this doesn't work, producing a compile-time no ': _*' annotation allowed here
error.