Spark-scala aggregate on multiple columns from a list

Question

I have a dataframe with several numeric columns which are not fixed (they can change during each execution). Let's say I have a Seq object with the names of the numeric columns. I would like to apply an aggregation function for each of these columns. I have tried the following:

println(numeric_cols)
// -> Seq[String] = List(avgTkts_P1, avgTkts_P2, avgTkts_P3, avgTkts_P4)

var sum_ops = for (c <- numeric_cols) yield org.apache.spark.sql.functions.sum(c).as(c)

var result = df.groupBy($"ID").agg( sum_ops:_* )

But it gives me the following error:

scala> var avgTktsPerPeriodo = df.groupBy("ID").agg(sum_ops:_*)
<console>:79: error: overloaded method value agg with alternatives:
  (expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame <and>
  (exprs: java.util.Map[String,String])org.apache.spark.sql.DataFrame <and>
  (exprs: scala.collection.immutable.Map[String,String])org.apache.spark.sql.DataFrame <and>
  (aggExpr: (String, String),aggExprs: (String, String)*)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.sql.Column)

Any idea if this is doable in spark-scala?

score 1 · Answer 1 · answered Sep 04 '18 at 14:41

If you look at one of the signatures:

(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame

The first argument is a Column expression and the second argument is a varargs.

You need to do something like:

val result = df.groupBy($"ID").agg( sum_ops.head, sum_ops.tail:_* )

score 0 · Answer 2 · answered Sep 04 '18 at 14:42

0

Ok found a solution (agg function in Spark accepts a Map[colname -> operation]):

var agg_ops =  numeric_cols map (c => c -> "sum") toMap

var result = df.groupBy($"ID").agg( agg_ops )

answered Sep 04 '18 at 14:42

revy

3,945
7
40
85

Spark-scala aggregate on multiple columns from a list

2 Answers2