I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+