spark - Calculating average of values in 2 or more columns and putting in new column in every row

Question

Suppose I have a Dataset/Dataframe with following contents:-

name, marks1, marks2
Alice, 10, 20
Bob, 20, 30

I want to add a new column which should have the average of column B and C.

Expected Result:-

name, marks1, marks2, Result(Avg)
Alice, 10, 20, 15
Bob, 20, 30, 25

for Summing or any other arithmetic operation I use df.withColumn("xyz", $"marks1"+$"marks2"). I cannot find a similar way for Average. Please help.

Additionally:- The number of columns are not fixed. Like sometimes it might be average of 2 columns, sometimes 3 or even more. So I want a generic code which should work.

Ramesh Maharjan · Accepted Answer · 2018-08-01T14:05:39.357

One of the easiest and optimized way is to create a list of columns of marks columns and use it with withColumn as

pyspark

from pyspark.sql.functions import col

marksColumns = [col('marks1'), col('marks2')]

averageFunc = sum(x for x in marksColumns)/len(marksColumns)

df.withColumn('Result(Avg)', averageFunc).show(truncate=False)

and you should get

+-----+------+------+-----------+
|name |marks1|marks2|Result(Avg)|
+-----+------+------+-----------+
|Alice|10    |20    |15.0       |
|Bob  |20    |30    |25.0       |
+-----+------+------+-----------+

scala-spark

the process is almost same in scala as done in python above

import org.apache.spark.sql.functions.{col, lit}

val marksColumns = Array(col("marks1"), col("marks2"))

val averageFunc = marksColumns.foldLeft(lit(0)){(x, y) => x+y}/marksColumns.length

df.withColumn("Result(Avg)", averageFunc).show(false)

which should give you same output as in pyspark

I hope the answer is helpful

This solution is superior in performance, because it relies on vector calculations by Spark. — PyGuy, Jan 22 '21 at 16:31

Álvaro Valencia · Answer 2 · 2018-08-01T14:30:46.190

It's as easy as using User Defined Functions. By creating a specific UDF to deal with average of many columns, you will be able to reuse it as many times as you want.

Python

In this snippet, I'm creating a UDF that takes an array of columns, and calculates the average of it.

from pyspark.sql.functions import udf, array
from pyspark.sql.types import DoubleType

avg_cols = udf(lambda array: sum(array)/len(array), DoubleType())

df.withColumn("average", avg_cols(array("marks1", "marks2"))).show()

Output:

+-----+------+------+--------+
| name|marks1|marks2| average|
+-----+------+------+--------+
|Alice|    10|    20|    15.0|
|  Bob|    20|    30|    25.0|
+-----+------+------+--------+

Scala

With the Scala API, you must process the selected columns as a Row. You just have to select the columns using the Spark struct function.

import org.apache.spark.sql.functions._
import spark.implicits._
import scala.util.Try

def average = udf((row: Row) => {
  val values = row.toSeq.map(x => Try(x.toString.toDouble).toOption).filter(_.isDefined).map(_.get)
  if(values.nonEmpty) values.sum / values.length else 0.0
})

df.withColumn("average", average(struct($"marks1", $"marks2"))).show()

As you can see, I'm casting all any values to Double with Try, so that if the value cannot be casted, it won't throw any exception, performing the average only on those columns that are defined.

And that's all :)

This code needs to import scala.util.Try and org.apache.spark.sql.Row I tried this, It shows 0.0 Average in all rows. Doesn't give me Actual Average. Please help. — Akshat Chaturvedi, Aug 01 '18 at 12:57
I think it's because you need to cast to Int rather than Double. I have edited my answer @AkshatChaturvedi — Álvaro Valencia, Aug 01 '18 at 13:02
@AkshatChaturvedi I've modified the answer casting the `Any` to `String` and then to `Double`. In this way, you will be able to calculate the average of many columns as you want, even if the the column types are different between them (for example, you can calculate the average of three column whose types are `String`, `Long` and `Double`, for instance). And of course, you can reuse it whenever you want, just call the `avarage` function :) — Álvaro Valencia, Aug 01 '18 at 14:28

spark - Calculating average of values in 2 or more columns and putting in new column in every row

2 Answers2

pyspark

scala-spark

Python

Scala