I have a python function which takes 2-input parameters and do some calculation and return some value
def func(column1,column2):
if float(column1)!=1 and float(column2) !=0:
return float(min(1,norm.cdf(norm.ppf(column1) - float(column2))/column1))
else:
return 0
Now i have converted this function to Pyspark UDF using :
udf_func = udf(func,FloatType())
Now I wanted to use this function on multiple columns so i am using for loop to iterate through multiple columns
this is the Dataframe I am using: a = [ (1, 3, 4, 6, 4), (2, 2, 2, 4, 7), (3, 1, 5, 2, 2), (4, 4, 3, 6, 5), ]
b = ["column1", "column2", "column3", "column4", "column5"]
df_test = spark.createDataFrame(a, b) df_test.show()
+-------+-------+-------+-------+-------+ |column1|column2|column3|column4|column5| +-------+-------+-------+-------+-------+ | 1| 3| 4| 6| 4| | 2| 2| 2| 4| 7| | 3| 1| 5| 2| 2| | 4| 4| 3| 6| 5| +-------+-------+-------+-------+-------+
for i in range(1,5):
df = df.withColumn(f'col{i}',udf_func(f'column{i}','column5'))
I wanted it to work in Pyspark but I am error everytime i am executing..Please help as I am new to Pyspark
Error : line 4, in func TypeError: _() takes 1 positional argument but 2 were given