performance about withColumn many times or use udf

Asked Sep 10 '22 at 19:53

Active Mar 06 '23 at 21:28

Viewed 149 times

I have a curious question about dataframe, I saw this code

 df.
  withColumn("emailF", trim($"email")).
  withColumn("emailF", regexp_replace($"emailF", " +", "")).
  withColumn("emailF", lower($"emailF"))

But I said that I prefer to use udf function to apply all rules to format email, a udf like this:

val customUdf = udf((txt:String) => {
txt.toLower.trim.replaceAll(" +","")
//another logic
})

my question is: What is better? use multiple withColumn in the same column to apply multiple functions or use a one function and apply all rules inside.

thanks for your answers and suggestions.

edited Mar 06 '23 at 21:28

Oli

9,766
5
25
46

asked Sep 10 '22 at 19:53

chavalife17

The only way to be sure is to test by yourself. – Gaël J Sep 10 '22 at 20:13
I would bet it's very similar though as I guess Spark optimisations "merge" the 3 `withColumn`. – Gaël J Sep 10 '22 at 20:14
1

Some helpful links in this context: [Spark functions vs UDF performance](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance), [Introducing Pandas UDF for PySpark](https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), [Working with UDFs in Apache Spark](https://blog.cloudera.com/working-with-udfs-in-apache-spark/) – Azhar Khan Sep 11 '22 at 09:45

performance about withColumn many times or use udf

0 Answers0