How is possible that I can use some regular Python functions as UDF?

Question

Consider following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


def test(val):
   return val + 1


if __name__ == '__main__':
  ss = (SparkSession.builder.appName("Test").getOrCreate())

  data = [("James", 3000),
        ("Michael", 4005),
        ("Robert", 4005)]

  schema = (StructType([
    StructField("firstname", StringType(), True),
    StructField("salary", IntegerType(), True)
   ]))

  df = ss.createDataFrame(data=data, schema=schema)

  df.withColumn("test", test(col("salary"))).show()

How is it possible that it correctly outputs:

+---------+------+----+
|firstname|salary|test|
+---------+------+----+
|    James|  3000|3001|
|  Michael|  4005|4006|
|   Robert|  4005|4006|
+---------+------+----+

While test function is NOT defined as test =udf(test, FloatType()) and it's not registered using spark.udf.register()?

Why every tutorial then asks to define and register a UDF? How is it possible that the above just works?

Are you running you code locally? – abiratsis Aug 07 '23 at 14:49 — abiratsis, Aug 07 '23 at 14:49

score 1 · Accepted Answer · answered Aug 07 '23 at 12:30

Close to a duplicate of when-to-use-a-udf-versus-a-function-in-pyspark but the short answer is your function just returns the col("salary") + 1 (via Column add), which gets evaluated as Spark Expressions.

See here for an explanation of the magic behind add.

Wrapping it in udf forces col("salary") to be evaluated, sent to your udf (which adds one to it) and returned, this will be slower for large amounts of data or calculations than using the Column interface directly.

How is possible that I can use some regular Python functions as UDF?

1 Answers1