I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.
To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.
Here is what it looks like in Scala
def getMax(inputArray: Array[Int]): Int = {
var maxValue = inputArray(0)
for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
maxValue = inputArray(i)
}
maxValue
}
This is what I have so far, and get this error
"value length is not a member of org.apache.spark.sql.column",
and I don't know how else to iterate through the column.
def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
maxValue = col(i)
}
maxValue
}
Once I am able to implement my own method, I will create a column function
val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)
And then I hope to be able to use this in a SQL statement, for example
val sample = sqlContext.sql("SELECT value_max(x) FROM table")
and the expected output would be 9, given input column [3,8,2,5,9]
I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation. The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?