9

I'm using Spark with Databricks and have the following code:

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

Both of these next statements work:

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

and using a UDF:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

It is unclear to me from the documentation when I should use one over the other and why?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Rodney
  • 5,417
  • 7
  • 54
  • 98

2 Answers2

4

An UDF can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when, col, etc. By using an UDF the replaceBlanksWithNulls function can be written as normal python code:

def replaceBlanksWithNulls(s):
    return "" if s != "" else None

which can be used on a dataframe column after registering it:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

Note: The default return type of an UDF is strings. If another type is required that must be specified when registering it, e.g.

from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())

In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls as in the question:

x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))

This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks for the explanation - I am finding that writing PySpark code inline (which is more efficient) means I get code reuse. For a simple example, let's say I want to extend the function replaceBlanksWithNulls and also replace NAN or another value with null - Instead of just changing it in one place I'd have to find all of the places I used the inline code. So it may be more efficient, but I am finding it does not lend itself well to reuse... thoughts? – Rodney May 21 '19 at 11:20
  • @Rodney: I would recommend you to continue using methods such as the `replaceBlanksWithNulls` method in your question. It is not an udf since it returns an expression that can be used on one/multiple columns. The performance of this approach is equivalent to "normal" inline code while allowing you to make extensions at a single place. – Shaido May 21 '19 at 11:35
3

You can find the difference in the Spark SQL (as mentioned in the document). For example, you can find that if you write:

spark.sql("select replaceBlanksWithNulls(column_name) from dataframe")

does not work if you didn't register the function replaceBlanksWithNulls as a udf. In spark sql we need to know the returned type of the function for the exectuion. Hence, we need to register the custom function as a user-defined function (udf) to be used in spark sql.

OmG
  • 18,337
  • 10
  • 57
  • 90
  • Thanks - so are you saying in PySpark there is no difference (e.g. performance is the same? – Rodney May 09 '19 at 05:03
  • 1
    @Rodney my pleasure. this question is answered https://stackoverflow.com/q/38296609/3768871 as mentioned in the answer of Shaido. – OmG May 09 '19 at 06:06