5

I have a dataframe which has multiple columns. I want to group by one of the columns and aggregate other columns all the once. Let's say the table have 4 columns, cust_id, f1,f2,f3 and I want to group by cust_id and then get avg(f1), avg(f2) and avg(f3).The table will have many columns. Any hints?

The following code is good start but as I have many columns it may not be good idea to manually write them.

df.groupBy("cust_id").agg(sum("f1"), sum("f2"), sum("f3"))
HHH
  • 6,085
  • 20
  • 92
  • 164
  • 1
    Possible duplicate of [SparkSQL: apply aggregate functions to a list of column](http://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column) –  Aug 13 '16 at 03:57

1 Answers1

12

Maybe you can try mapping a list with the colum names:

val groupCol = "cust_id"
val aggCols = (df.columns.toSet - groupCol).map(
  colName => avg(colName).as(colName + "_avg")
).toList

df.groupBy(groupCol).agg(aggCols.head, aggCols.tail: _*)

Alternatively, if needed, you can also match the schema and build the aggregations based on the type:

val aggCols = df.schema.collect {
  case StructField(colName, IntegerType, _, _) => avg(colName).as(colName + "_avg")
  case StructField(colName, StringType, _, _) => first(colName).as(colName + "_first")
}
Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • how can I name the agg columns to something like f1_avg then? – HHH Aug 12 '16 at 20:33
  • @H.Z. just put a `.as()` after. In the first example: `.map(colName => avg(colName).as(colName+"_avg"))` In the second example, just put the `.as()` just after the function – Daniel de Paula Aug 12 '16 at 20:34
  • That's incredible! agg(aggCols: _*), but agg(aggCols.head, aggCols.tail: _*) does! Pure magic! Could you please explain the reason behind it? Thx. – J.J. Sep 19 '17 at 21:55
  • 1
    @JennyYueJin If you look at the available `agg` signatures in the [docs](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset), there is not a `agg(exprs: Column*)` option, only the option with head and tail. I'm not sure why they chose this behaviour, but I believe it's to avoid calling `agg` with an empty parameter list, like `df.groupBy("col_a").agg()`, which would be possible with a single `exprs: Column*` parameter. – Daniel de Paula Sep 20 '17 at 08:41