Spark Build Custom Column Function, user defined function

Question

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.

To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.

Here is what it looks like in Scala

def getMax(inputArray: Array[Int]): Int = {
   var maxValue = inputArray(0)
   for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
     maxValue = inputArray(i)
   }
   maxValue
}

This is what I have so far, and get this error

"value length is not a member of org.apache.spark.sql.column",

and I don't know how else to iterate through the column.

def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
    maxValue = col(i)
}
maxValue

}

Once I am able to implement my own method, I will create a column function

val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)

And then I hope to be able to use this in a SQL statement, for example

val sample = sqlContext.sql("SELECT value_max(x) FROM table")

and the expected output would be 9, given input column [3,8,2,5,9]

I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation. The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?

Please show your input and output/expected dataframes. Use `show`. — Jacek Laskowski, Apr 11 '16 at 21:31
Hi @JacekLaskowski thanks for the comment, I've edited the question to make it clearer what I would like to achieve. — other15, Apr 12 '16 at 10:19

score 29 · Accepted Answer · edited Jun 06 '18 at 02:16

In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.

However, to process the values of a column, you have some options and the right one depends on your task:

1) Using the existing built-in functions

Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.

2) Creating an UDF

If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:

def myFunc: (String => String) = { s => s.toLowerCase }

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

val newDF = df.withColumn("newCol", myUDF(df("oldCol")))

For more information, here's a nice article.

3) Using an UDAF

If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:

https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/

4) Fall back to RDD processing

If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:

val singleColumnDF = df("column")

val myRDD = singleColumnDF.rdd

// process myRDD

So, there was the options I could think of. I hope it helps.

Thanks Daniel, very informative. So the main difference between UDF and UDAF is that a UDAF returns one value based on column calculation? I am hoping that the built in functions will be sufficient for what I want to do, but it would be good to know how to implement my own functions. — other15, May 14 '16 at 14:26
@other15 An UDAF is usually applied with `groupBy`, so it can return an aggregated value for each distinct value in the columns passed to `groupBy` (similar to how a simple `df.groupBy("key").agg(avg("value"))` works). However, if you don't use groupBy, the UDAF will return only one value. — Daniel de Paula, May 14 '16 at 14:47

score 4 · Answer 2 · answered Feb 02 '17 at 08:12

4

An easy example is given in the excellent documentation, where a whole section is dedicated to UDFs:

import org.apache.spark.sql._

val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
val spark = df.sparkSession
spark.udf.register("simpleUDF", (v: Int) => v * v)
df.select($"id", callUDF("simpleUDF", $"value"))

answered Feb 02 '17 at 08:12

Boern

7,233
5
55
86

The link http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ redirects to the http://spark.apache.org/docs/latest/api/scala/index.html#package . Couldn't you fix it? – Hryhorii Liashenko Jul 15 '19 at 11:53

Spark Build Custom Column Function, user defined function

2 Answers2

Linked