I have a python script which iterates through a list of URLs in s3 to re-partition the parquet files within each URL and then write the merged file to another s3 destination. For a small list of URLs I use python's multiprocessing.Pool function to parallelize the processes.
I now need to run this logic for a thousands of URLs. To be able to finish the re-partitioning of all the URLs on time I want to utilize pyspark library and deploy as a spark job on a cluster. Here is the code in Spark:
def myfunction(x):
# spark session is available here
spark.read.parquet("s3_src").coalesce(1).write.parquet("s3_dest" + x +"/")
if __name__ == "__main__":
# spark context is initiated and available here
rdd_urls = [url1, url2, url3, url4, ……, urlN]
rdd_urls.map(lambda x: myfunction(x))
I tried utilizing both the RDD.map() and RDD.foreach() but realized the Spark executors could not handle the inner code block within the for loop and throws the following error:
_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
My current understanding is that executors cannot submit spark jobs to a cluster but only the driver can do so. But as a newbie to Spark, I am trying to find out how to achieve the same objective in Spark. Any help with code example is greatly appreciated.