0

I have a python function which takes 2-input parameters and do some calculation and return some value

def func(column1,column2):
  if float(column1)!=1 and float(column2) !=0:
    return float(min(1,norm.cdf(norm.ppf(column1) - float(column2))/column1))
else:
   return 0

Now i have converted this function to Pyspark UDF using :

udf_func = udf(func,FloatType())

Now I wanted to use this function on multiple columns so i am using for loop to iterate through multiple columns

this is the Dataframe I am using: a = [ (1, 3, 4, 6, 4), (2, 2, 2, 4, 7), (3, 1, 5, 2, 2), (4, 4, 3, 6, 5), ]

b = ["column1", "column2", "column3", "column4", "column5"]

df_test = spark.createDataFrame(a, b) df_test.show()

+-------+-------+-------+-------+-------+ |column1|column2|column3|column4|column5| +-------+-------+-------+-------+-------+ | 1| 3| 4| 6| 4| | 2| 2| 2| 4| 7| | 3| 1| 5| 2| 2| | 4| 4| 3| 6| 5| +-------+-------+-------+-------+-------+

for i in range(1,5):
  df = df.withColumn(f'col{i}',udf_func(f'column{i}','column5'))

I wanted it to work in Pyspark but I am error everytime i am executing..Please help as I am new to Pyspark

Error : line 4, in func TypeError: _() takes 1 positional argument but 2 were given

rakesh
  • 29
  • 1
  • 6
  • can you please provide some data to work on, dont post images. Please take a moment to read about how to post spark questions: https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples – YOLO Sep 12 '20 at 07:01
  • I have added the data..Please have a look..Thank you for your help! – rakesh Sep 12 '20 at 08:10

0 Answers0