4

I want to group a dataframe on a single column and then apply an aggregate function on all columns.

For example, I have a df with 10 columns. I wish to group on the first column "1" and then apply an aggregate function 'sum' on all the remaining columns, (which are all numerical).

The R equivalent of this is summarise_all. Ex in R.

df = df%>%group_by(column_one)%>%summarise_all(funs(sum))

I do not want to manually enter the columns in the aggregate command in pyspark, as the number of columns in the dataframe will be dynamic.

ALEX MATHEW
  • 251
  • 1
  • 5
  • 13

2 Answers2

8

Here is the example for pyspark:

df = spark.createDataFrame([(1,10,20,30),(1,10,20,30),(1,10,20,30)],['id','value1','value2','value3'])

exprs = {x: "sum" for x in df.columns if x is not df.columns[0]}

df.groupBy("id").agg(exprs).show()

#+---+-----------+-----------+-----------+
#| id|sum(value2)|sum(value1)|sum(value3)|
#+---+-----------+-----------+-----------+
#|  1|         60|         30|         90|
#+---+-----------+-----------+-----------+

Here df.columns[0] represents first column of df. You can also specify groupBy column by name like below.

exprs = {x: "sum" for x in df.columns if x is not 'id'}
Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39
  • is it possible to use the same pattern as above but replace "sum" with a user defined function? – langtang Nov 17 '22 at 02:12
  • 1
    how would you go about giving each column an alias that is the same as their original name? ex: instead of 'sum(value2)' the column would be aliased to simply 'value2' – BenjaminFranklinGates Feb 08 '23 at 17:02
0

If I were to do that in spark scala, I would use column names and define aggregation functions like that:

val df = List(("a", 1,2,3), ("a", 4,4,4)).toDF("id", "a", "b", "c")

// Note: df.columns.tail returns Array(a, b, c)
val aggs = df.columns.tail.map(_ -> "sum").toMap
//aggs: scala.collection.immutable.Map[String,String] = Map(a -> sum, b -> sum, c -> sum)

// Group by and execute aggregates:
df.groupBy($"id").agg(aggs).show
+---+------+------+------+
| id|sum(a)|sum(b)|sum(c)|
+---+------+------+------+
|  a|     5|     6|     7|
+---+------+------+------+

Another option would be to run the same sum on all the columns names specified:

df.groupBy($"id").sum(df.columns.tail: _*).show()  // to python users, :_* is a scala operator used to expand a list into a vararg

NOTE: you might be interested in this documentation: https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.RelationalGroupedDataset

Michel Lemay
  • 2,054
  • 2
  • 17
  • 34