0

I need to implement a recursive function that uses rdd (I will write a function that will send something to be calculated to the cluster and it will again send some calculations to other workers).

It gives PicklingError because of calling a function that uses sc :

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

How can I implement this algorithm? I am calling recursive(5) in the function which uses SparkContext sc within it and this is not possible when using Spark but how can i handle this state of "sending again some calculations to other cluster"?

The code was simplified:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local[*]")\
        .appName("test")\
        .getOrCreate()
sc = spark.sparkContext

def recursive(i):
        
    def main_calculation(j):
        string_to_return=""
        # calculate something and concatenate string with that
        recursive(5)#i need recursion because of tree algorithm- of course it's not calling forever it will stop but the code was very simplified
        return string_to_return
        
    rdd =sc.parallelize([0,1,2,3,4])
    log_values = rdd.map(lambda n :  main_calculation(n))
    llist = log_values.collect()
        
recursive(0)

spark version is 3.0.0

Kavakli
  • 1
  • 2
  • The linked post explains why do you get this error. Could you explain why is this different? You are calling `recursive(5)` in your function which uses SparkContext `sc` within it and this not possible when using Spark as the error message you get clearly states. – blackbishop Dec 28 '21 at 21:59
  • Thank you for your attention, i am sorry for misunderstanding @blackbishop – Kavakli Dec 28 '21 at 22:11
  • It's not possible to reference, `sc` or `rdd` inside other tranformations. If you can update the question to explain the operation you need (with native python code), then someone can help implement it in Spark. – Nithish Dec 29 '21 at 20:23
  • What about UDF's in pyspark? Can we do recursion like this example and parallellize (sending again some calculations to other cluster)? @Nithish – Kavakli Dec 31 '21 at 02:12
  • What do you mean by send to other cluster? – Nithish Dec 31 '21 at 07:56
  • i am sorry for the wrong word, i meant workers. Can a worker send some calculations to other workers or make other workers the calculation by udf's in recursion @Nithish – Kavakli Dec 31 '21 at 21:46
  • In Spark workers don't instruct other workers to perform a task, it's handled by the Application Master. That said, the computation can be parallelized even if it's a recursion by applying it in a fashion Spark can parallelize – Nithish Dec 31 '21 at 22:19

0 Answers0