When to use a UDF versus a function in PySpark?

Question

I'm using Spark with Databricks and have the following code:

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

Both of these next statements work:

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

and using a UDF:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

It is unclear to me from the documentation when I should use one over the other and why?

Shaido · Accepted Answer · 2019-05-09T05:35:03.677

An UDF can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when, col, etc. By using an UDF the replaceBlanksWithNulls function can be written as normal python code:

def replaceBlanksWithNulls(s):
    return "" if s != "" else None

which can be used on a dataframe column after registering it:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

Note: The default return type of an UDF is strings. If another type is required that must be specified when registering it, e.g.

from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())

In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls as in the question:

x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))

This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?

Thanks for the explanation - I am finding that writing PySpark code inline (which is more efficient) means I get code reuse. For a simple example, let's say I want to extend the function replaceBlanksWithNulls and also replace NAN or another value with null - Instead of just changing it in one place I'd have to find all of the places I used the inline code. So it may be more efficient, but I am finding it does not lend itself well to reuse... thoughts? — Rodney, May 21 '19 at 11:20
@Rodney: I would recommend you to continue using methods such as the `replaceBlanksWithNulls` method in your question. It is not an udf since it returns an expression that can be used on one/multiple columns. The performance of this approach is equivalent to "normal" inline code while allowing you to make extensions at a single place. — Shaido, May 21 '19 at 11:35

score 3 · Answer 2 · answered May 09 '19 at 03:34

3

You can find the difference in the Spark SQL (as mentioned in the document). For example, you can find that if you write:

spark.sql("select replaceBlanksWithNulls(column_name) from dataframe")

does not work if you didn't register the function replaceBlanksWithNulls as a udf. In spark sql we need to know the returned type of the function for the exectuion. Hence, we need to register the custom function as a user-defined function (udf) to be used in spark sql.

answered May 09 '19 at 03:34

OmG

18,337
10
57
90

Thanks - so are you saying in PySpark there is no difference (e.g. performance is the same? – Rodney May 09 '19 at 05:03
1

@Rodney my pleasure. this question is answered https://stackoverflow.com/q/38296609/3768871 as mentioned in the answer of Shaido. – OmG May 09 '19 at 06:06

When to use a UDF versus a function in PySpark?

2 Answers2

Linked