Applying function to Spark Dataframe Column

Question

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala

def round_tenths_place( un_rounded:Double ) : Double = {
    val rounded = BigDecimal(un_rounded).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
    return rounded
}

And apply it to a one column of a dataframe - kind of what I hoped this would do:

 bid_results.withColumn("bid_price_bucket", round_tenths_place(bid_results("bid_price")) )

I haven't found any easy way and am struggling to figure out how to do this. There's got to be an easier way than converting the dataframe to and RDD and then selecting from rdd of rows to get the right field and mapping the function across all of the values, yeah? And also something more succinct creating a SQL table and then doing this with a sparkSQL UDF?

http://stackoverflow.com/questions/29109916/updating-a-dataframe-column-in-spark — Jean Logeart, Feb 05 '16 at 15:29

score 21 · Accepted Answer · edited Oct 27 '20 at 06:39

21

You can define an UDF as follows:

val round_tenths_place_udf = udf(round_tenths_place _)
bid_results.withColumn(
  "bid_price_bucket", round_tenths_place_udf($"bid_price"))

although built-in Round expression is using exactly the same logic as your function and should be more than enough, not to mention much more efficient:

import org.apache.spark.sql.functions.round

bid_results.withColumn("bid_price_bucket", round($"bid_price", 1))

See also following:

edited Oct 27 '20 at 06:39

JAG

48
5

answered Feb 05 '16 at 15:29

zero323

322,348
103
959
935

How can i parametrized the $"bid_price" call? suppose I have the column name stored in a variable. e.g. val tg_column = "bid_price" and then something like $tg_column – Nambu14 Aug 30 '19 at 20:00
This approach is causing `NotSerializableException` in Databricks. The only way seems to be writing an arrow function and wrapping into `udf()`. If I reference any other function, I am getting this exception. – greatvovan Oct 22 '21 at 05:06

Applying function to Spark Dataframe Column

1 Answers1

Linked