0

I have written below function in pyspark to get deptid and return a dataframe which i want to use in spark sql .

def get_max_salary(deptid):

sql_salary="select max(salary) from empoyee where depid ={}"

df_salary = spark.sql(sql_salary.format(deptid)) return df_salary spark.udf.register('get_max_salary',get_max_salary)

However i get below error message . I searched online but i couldnt find a proper solution anywhere . could someone please help me here

Error Message - PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

  • Possible duplicate of [Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion](https://stackoverflow.com/questions/31508689/spark-broadcast-variables-it-appears-that-you-are-attempting-to-reference-spar) – pault Sep 26 '19 at 13:54
  • One more thing (Probably not the solution). There is maybe a typo in your Code "select max(salary) from empoyee where " do you mean "employee" – RacoonOnMoon Oct 01 '19 at 12:12
  • Hi thanks for response .but issue seems releated to serialization in python . – Pyspark Developer Oct 01 '19 at 14:52
  • According to Spark "Registers a python function (including lambda function) as a UDF so it can be used in SQL statements." similar to min or max functions. You cannot refer spark context in the UDF as much as I know. – Abraham Oct 04 '19 at 13:44
  • I guess that's not correct .I have tried the same using Scala but couldn't get it done – Pyspark Developer Oct 05 '19 at 14:56
  • @PysparkDeveloper Were you able to find the solution? I am facing similar issue. Any inputs is greatly appreciated. Version - Spark 3.0.1 – Idleguys Dec 06 '20 at 04:34
  • As you haven't provided your pusedo code .so I assuming your code looks similar to what I have posted above .I can't recall what work around I had one year back but try this ..spark session object i.e. spark should not be used with In custom function body either create an arg to fn for spark session object or say self.spark ..if these doesn't work create a fn that just return spark SQL code and execute after it – Pyspark Developer Dec 07 '20 at 05:26

0 Answers0