Consider following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
def test(val):
return val + 1
if __name__ == '__main__':
ss = (SparkSession.builder.appName("Test").getOrCreate())
data = [("James", 3000),
("Michael", 4005),
("Robert", 4005)]
schema = (StructType([
StructField("firstname", StringType(), True),
StructField("salary", IntegerType(), True)
]))
df = ss.createDataFrame(data=data, schema=schema)
df.withColumn("test", test(col("salary"))).show()
How is it possible that it correctly outputs:
+---------+------+----+
|firstname|salary|test|
+---------+------+----+
| James| 3000|3001|
| Michael| 4005|4006|
| Robert| 4005|4006|
+---------+------+----+
While test function is NOT defined as test =udf(test, FloatType())
and it's not registered using spark.udf.register()
?
Why every tutorial then asks to define and register a UDF? How is it possible that the above just works?