2

I'm trying to find a way to calculate the mean of rows in a Spark Dataframe in Scala where I want to ignore NAs. In R, there is a very convenient function called rowMeans where one can specify to ignore NAs:

rowmeans(df,na.rm=TRUE)

I'm unable to find a corresponding function for Spark Dataframes, and I wonder if anyone has a suggestion or input if this would be possible. Replacing them with 0 won't due since this will affect the denominator.

I found a similar question here, however my dataframe will have hundreds of columns.

Any help and shared insights is appreciated, cheers!

Community
  • 1
  • 1
Charmander_
  • 55
  • 1
  • 9

2 Answers2

0

Usually such functions ignore nulls by default. Even if there are some mixed columns with numeric and string types, this one will drop strings and nulls, and calculate only numerics.

df.select(df.columns.map(c => mean(col(c))) :_*).show
  • 1
    Thank you for your input. However, I'm looking for a way to compute the mean of the rows in a dataframe. – Charmander_ Apr 03 '17 at 14:13
  • Sry, I confused rows with columns. Then for rows it is also easy. First we fill nulls with 0, then compute columns of means. val df_filled = df.na.fill("0"); val nrow = n; val sumDF = df_filled.withColumn("TOTAL", df_filled.columns.map(c => col(c)).reduce((c1, c2) => (c1 + c2)/nrow)); sumDF.show() – Sergio Alyoshkin Apr 04 '17 at 06:33
  • Hi, yes that will work but then again as I stated; if we fill NAs with zeros, this will affect the denominator making the computed means biased, if I assume that is how it will be computed. And as in my case, I will have around 1500 columns that I want to sum, making it quite impossible to say which columns to sum with a reduce statement. So in conclusion, I need to sum rows and compute mean where NAs are not taking into consideration for a large number of columns. Simple thing, but at the same time not.. – Charmander_ Apr 04 '17 at 06:53
  • Right. I guess iteratively collect to R and compute rowmeans of each batch will be faster than try to find out direct way to do it on scala. – Sergio Alyoshkin Apr 04 '17 at 07:39
  • Yes, probably. However I'm not sure how to accomplish that so I might need to skip this step completely in my data processing.. – Charmander_ Apr 04 '17 at 07:55
  • why not ? you simply execute collect(df) and your spark DafaFrame becomes R data.frame – Sergio Alyoshkin Apr 04 '17 at 08:47
  • Yes, but I cannot call R functions from within the Scala code, right? Or how do you mean? I'm running a large data processing step on a remote cluster, and I will not have time to look into how to set up a hybrid solution that uses Scala and R. I'm conducting my master thesis so I'm quite time limited.. :( – Charmander_ Apr 04 '17 at 09:07
  • I ran your command for columns `df.select(df.columns.map(c => mean(col(c))) :_*).show`, and the column with `NaN` values came out to have a mean of `NaN` – Amazonian Feb 07 '22 at 10:22
-1

You can do this by first identifying which fields are numeric, and then selecting their mean for each row...

import org.apache.spark.sql.types._

val df = List(("a",1,2,3.0),("b",5,6,7.0)).toDF("s1","i1","i2","i3")

// grab numeric fields
val numericFields = df.schema.fields.filter(f => f.dataType==IntegerType || f.dataType==LongType || f.dataType==FloatType || f.dataType==DoubleType || f.dataType==ShortType).map(_.name)

// compute mean
val rowMeans = df.select(numericFields.map(f => col(f)).reduce(_+_) / lit(numericFields.length) as "row_mean")

rowMeans.show
Ben Horsburgh
  • 563
  • 4
  • 10