It's as easy as using User Defined Functions. By creating a specific UDF to deal with average of many columns, you will be able to reuse it as many times as you want.
Python
In this snippet, I'm creating a UDF that takes an array of columns, and calculates the average of it.
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DoubleType
avg_cols = udf(lambda array: sum(array)/len(array), DoubleType())
df.withColumn("average", avg_cols(array("marks1", "marks2"))).show()
Output:
+-----+------+------+--------+
| name|marks1|marks2| average|
+-----+------+------+--------+
|Alice| 10| 20| 15.0|
| Bob| 20| 30| 25.0|
+-----+------+------+--------+
Scala
With the Scala API, you must process the selected columns as a Row. You just have to select the columns using the Spark struct
function.
import org.apache.spark.sql.functions._
import spark.implicits._
import scala.util.Try
def average = udf((row: Row) => {
val values = row.toSeq.map(x => Try(x.toString.toDouble).toOption).filter(_.isDefined).map(_.get)
if(values.nonEmpty) values.sum / values.length else 0.0
})
df.withColumn("average", average(struct($"marks1", $"marks2"))).show()
As you can see, I'm casting all any values to Double with Try
, so that if the value cannot be casted, it won't throw any exception, performing the average only on those columns that are defined.
And that's all :)