1

I've been expanding my knowledge of spark over the last few weeks with all the testing that I've been doing for work, and I'm a bit confused as when it's appropriate to use A UDF and when it's not. Looking over some of peers code, they use a lot of UDF's when utilizing dataframes but they're so resource intensive. As I've refactored a lot of their code I'm using spark.sql() to rewrite a lot of it and it's much faster and I'm only using spark functionality. With that being said, when is it appropriate to use a UDF vs just using spark's built in functionality?

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
BloodKid01
  • 111
  • 14

1 Answers1

6

It is quite simple: it is recommended to rely as much as possible on Spark's built-in functions and only use a UDF when your transformation can't be done with the built-in functions.

UDFs cannot be optimized by Spark's Catalyst optimizer, so there is always a potential decrease in performance. UDF's are expensive because they force representing data as objects in the JVM.

As you have also used the tag [pyspark] and as mentioned in the comment below, it might be of interest that "Panda UDFs" (aka vectorized UDFs) avoid the data movement between the JVM and Python. Instead they use Apache Arrow to transfer data and Pandas to process it. You can use Panda UDFs by using pandas_udf and read more about it in the Databricks blog Introducing Pandas UDF for PySpark which has a dedicated section on Performance Comparison.

Your peers might have used many UDFs because the built-in functions were not available on earlier version of Spark. Every release there are more functions being added.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77