I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out.
For example, consider my psdf (Pandas Spark DataFrame)
name p1 p2
0 AAA 1.0 1.0
1 BBB 1.0 1.0
I have a simple function,
def f(a:float, b:float) -> float:
return math.sqrt(a**2 + b**2)
I expect below psdf,
name p1 p2 R
0 AAA 1.0 1.0 1.4
1 BBB 1.0 1.0 1.4
The function is quite dynamic and I showed only a sample here. For example, there is another function with 3 arguments.
I tried below code but getting an error on not set compute.ops_on_diff_frames
parameter and document says it is expensive. Hence, want to avoid it.
psdf["R"] = psdf[["p1","p2"]].apply(lambda x: f(*x), axis=1)
Note: I saw one can convert to normal spark dataframe and use withColumn
but not sure if it will have performance penality
Any suggestions?