Using Spark to avoid sequential iteration

Question

I have a python script which iterates through a list of URLs in s3 to re-partition the parquet files within each URL and then write the merged file to another s3 destination. For a small list of URLs I use python's multiprocessing.Pool function to parallelize the processes.

I now need to run this logic for a thousands of URLs. To be able to finish the re-partitioning of all the URLs on time I want to utilize pyspark library and deploy as a spark job on a cluster. Here is the code in Spark:

def myfunction(x):
  # spark session is available here
  spark.read.parquet("s3_src").coalesce(1).write.parquet("s3_dest" + x +"/")

if __name__ == "__main__":
  # spark context is initiated and available here
  rdd_urls = [url1, url2, url3, url4, ……, urlN]
  rdd_urls.map(lambda x: myfunction(x))

I tried utilizing both the RDD.map() and RDD.foreach() but realized the Spark executors could not handle the inner code block within the for loop and throws the following error:

_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

My current understanding is that executors cannot submit spark jobs to a cluster but only the driver can do so. But as a newbie to Spark, I am trying to find out how to achieve the same objective in Spark. Any help with code example is greatly appreciated.

Initiate spark session separately in the function and then call in map(). Have a look into this : https://stackoverflow.com/questions/31508689/spark-broadcast-variables-it-appears-that-you-are-attempting-to-reference-spar — Jim Todd, Mar 21 '19 at 11:53
Possible duplicate of [Parallelizing independent actions on the same DataFrame in Spark](https://stackoverflow.com/questions/48378078/parallelizing-independent-actions-on-the-same-dataframe-in-spark) — user10938362, Mar 21 '19 at 12:17
@JimTodd [That's extremely bad advice}(https://stackoverflow.com/q/51153656/10938362) — user10938362, Mar 21 '19 at 12:18
Possible duplicate of [Apache Spark : When not to use mapPartition and foreachPartition?](https://stackoverflow.com/questions/49527047/apache-spark-when-not-to-use-mappartition-and-foreachpartition) — abiratsis, Mar 21 '19 at 14:08
I considered this one duplicate of the above since it seems that the code within the map function of rdd_urls is executed on the executors. The problem is that you can't use objects such as SparkSession, Dataframes, Datasets in the executors' code only on the driver's code. — abiratsis, Mar 21 '19 at 19:16

Using Spark to avoid sequential iteration

0 Answers0