I need to implement a recursive function that uses rdd (I will write a function that will send something to be calculated to the cluster and it will again send some calculations to other workers).
It gives PicklingError because of calling a function that uses sc
:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
How can I implement this algorithm? I am calling recursive(5)
in the function which uses SparkContext sc
within it and this is not possible when using Spark but how can i handle this state of "sending again some calculations to other cluster"?
The code was simplified:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local[*]")\
.appName("test")\
.getOrCreate()
sc = spark.sparkContext
def recursive(i):
def main_calculation(j):
string_to_return=""
# calculate something and concatenate string with that
recursive(5)#i need recursion because of tree algorithm- of course it's not calling forever it will stop but the code was very simplified
return string_to_return
rdd =sc.parallelize([0,1,2,3,4])
log_values = rdd.map(lambda n : main_calculation(n))
llist = log_values.collect()
recursive(0)
spark version is 3.0.0