PicklingError error for recursive call in Pyspark

Question

I need to implement a recursive function that uses rdd (I will write a function that will send something to be calculated to the cluster and it will again send some calculations to other workers).

It gives PicklingError because of calling a function that uses sc :

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

How can I implement this algorithm? I am calling recursive(5) in the function which uses SparkContext sc within it and this is not possible when using Spark but how can i handle this state of "sending again some calculations to other cluster"?

The code was simplified:

import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local[*]")\
        .appName("test")\
        .getOrCreate()
sc = spark.sparkContext

def recursive(i):
        
    def main_calculation(j):
        string_to_return=""
        # calculate something and concatenate string with that
        recursive(5)#i need recursion because of tree algorithm- of course it's not calling forever it will stop but the code was very simplified
        return string_to_return
        
    rdd =sc.parallelize([0,1,2,3,4])
    log_values = rdd.map(lambda n :  main_calculation(n))
    llist = log_values.collect()
        
recursive(0)

spark version is 3.0.0

The linked post explains why do you get this error. Could you explain why is this different? You are calling `recursive(5)` in your function which uses SparkContext `sc` within it and this not possible when using Spark as the error message you get clearly states. — blackbishop, Dec 28 '21 at 21:59
Thank you for your attention, i am sorry for misunderstanding @blackbishop — Kavakli, Dec 28 '21 at 22:11
It's not possible to reference, `sc` or `rdd` inside other tranformations. If you can update the question to explain the operation you need (with native python code), then someone can help implement it in Spark. — Nithish, Dec 29 '21 at 20:23
What about UDF's in pyspark? Can we do recursion like this example and parallellize (sending again some calculations to other cluster)? @Nithish — Kavakli, Dec 31 '21 at 02:12
i am sorry for the wrong word, i meant workers. Can a worker send some calculations to other workers or make other workers the calculation by udf's in recursion @Nithish — Kavakli, Dec 31 '21 at 21:46
In Spark workers don't instruct other workers to perform a task, it's handled by the Application Master. That said, the computation can be parallelized even if it's a recursion by applying it in a fashion Spark can parallelize — Nithish, Dec 31 '21 at 22:19

PicklingError error for recursive call in Pyspark

0 Answers0