how to find max value from multiple columns in dataframe in spark

Question

I have input spark dataframe as

sample A B C  D
1      1 3 5  7
2      6 8 10 9
3      6 7 8  1

I need to find the max among A,B,C,D columns which are subject marks. I need to create a new dataframe with max_marks as the new column.

sample A B C  D  max_marks
  1    1 3 5  7   7
  2    6 8 10 9   10
  3    6 7 8  1   8

I have done this using scala as

val df = df.columns.toSeq
val df1=df.foldLeft(df){(df,colName)=> df.withColumn("max_sub",max((colName)))
df.show()

I am getting an error message

"main" org.apache.spark.sql.AnalysisException:grouping expression sequence is empty this dataframe has about 100 columns so how to iterate over this dataframe It would be helpful to iterate over the data frame as the columns where the mean has to be found out are about 10 out of 100 column dataframe with about 10000 records I am looking to dynamically pass the columns without giving the column names manually which means to loop over the columns that i choose and perform any mathematical operation

score -1 · Answer 1 · answered Mar 05 '19 at 12:09

-1

There are many ways to accomplish this one of the ways would be using map.

Simple pseudo code to do what you want (It wont work in anyway but I think the idea is clear)

df = df.withColumn("max_sub", "A")
df.map({x=> {
    max = "A"
    maxVal = 0
    for col in x{
        if(col != "max_sub" && x.col > maxVal){
            max = col
            maxVal = x.col 
        }
    }
    x.max_sub = max
    x
})

answered Mar 05 '19 at 12:09

Ilya Brodezki

336
2
15

Thanks for the respons but is there any other way than using for loop. – sparker Mar 05 '19 at 12:38

how to find max value from multiple columns in dataframe in spark

1 Answers1