Mapping a function to multiple columns of pyspark dataframe

Question

I have a pyspark df which has many columns but a subset looks like this:

datetime	eventid	sessionid	lat	lon	filtertype
someval	someval	someval	someval	someval	someval
someval	someval	someval	someval	someval	someval

I want to map a function some_func() which only makes use of the columns 'lat', 'lon' and 'event_id' to return a Boolean value which would be added to the df as a separate column named 'verified'. Basically I need to retrieve the columns of interest inside the function separately and do my operations on them. I know I can use UDFs or df.withColumn() but they are used to map to single column. For that I need to concatenate columns of interest as one column which would make the code a bit messy.

Is there a way to retrieve the column values inside the function separately and map that function to the entire dataframe? (similar to what we can do with Pandas df using map-lambda & df.apply())?

Does this answer your question? [Pyspark: Pass multiple columns in UDF](https://stackoverflow.com/questions/42540169/pyspark-pass-multiple-columns-in-udf) — werner, Sep 06 '21 at 08:12

score 1 · Accepted Answer · answered Sep 06 '21 at 08:08

you can create a udf which can take up multiple column as parameters

ex:

from pyspark.sql.functions as f
from pyspark.sql.types import BooleanType

def your_function(p1, p2, p3):
    # your logic goes here
    # return a bool

udf_func = f.udf(your_function, BooleanType())


df = spark.read.....

df2 = df.withColumn("verified", udf_func(f.col("lat"), f.col("lon"), f.col("event_id")))

df2.show(truncate=False)

Mapping a function to multiple columns of pyspark dataframe

1 Answers1