0

I have a curious question about dataframe, I saw this code

 df.
  withColumn("emailF", trim($"email")).
  withColumn("emailF", regexp_replace($"emailF", " +", "")).
  withColumn("emailF", lower($"emailF"))

But I said that I prefer to use udf function to apply all rules to format email, a udf like this:

val customUdf = udf((txt:String) => {
txt.toLower.trim.replaceAll(" +","")
//another logic
})

my question is: What is better? use multiple withColumn in the same column to apply multiple functions or use a one function and apply all rules inside.

thanks for your answers and suggestions.

Oli
  • 9,766
  • 5
  • 25
  • 46
chavalife17
  • 162
  • 1
  • 2
  • 10
  • The only way to be sure is to test by yourself. – Gaël J Sep 10 '22 at 20:13
  • I would bet it's very similar though as I guess Spark optimisations "merge" the 3 `withColumn`. – Gaël J Sep 10 '22 at 20:14
  • 1
    Some helpful links in this context: [Spark functions vs UDF performance](https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance), [Introducing Pandas UDF for PySpark](https://www.databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), [Working with UDFs in Apache Spark](https://blog.cloudera.com/working-with-udfs-in-apache-spark/) – Azhar Khan Sep 11 '22 at 09:45

0 Answers0