An UDF
can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when
, col
, etc. By using an UDF
the replaceBlanksWithNulls
function can be written as normal python code:
def replaceBlanksWithNulls(s):
return "" if s != "" else None
which can be used on a dataframe column after registering it:
replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))
Note: The default return type of an UDF
is strings. If another type is required that must be specified when registering it, e.g.
from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())
In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls
as in the question:
x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))
This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?