0

I have a python script which iterates through a list of URLs in s3 to re-partition the parquet files within each URL and then write the merged file to another s3 destination. For a small list of URLs I use python's multiprocessing.Pool function to parallelize the processes.

I now need to run this logic for a thousands of URLs. To be able to finish the re-partitioning of all the URLs on time I want to utilize pyspark library and deploy as a spark job on a cluster. Here is the code in Spark:

def myfunction(x):
  # spark session is available here
  spark.read.parquet("s3_src").coalesce(1).write.parquet("s3_dest" + x +"/")

if __name__ == "__main__":
  # spark context is initiated and available here
  rdd_urls = [url1, url2, url3, url4, ……, urlN]
  rdd_urls.map(lambda x: myfunction(x))

I tried utilizing both the RDD.map() and RDD.foreach() but realized the Spark executors could not handle the inner code block within the for loop and throws the following error:

_pickle.PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

My current understanding is that executors cannot submit spark jobs to a cluster but only the driver can do so. But as a newbie to Spark, I am trying to find out how to achieve the same objective in Spark. Any help with code example is greatly appreciated.

Sharif
  • 1
  • 1
  • Initiate spark session separately in the function and then call in map(). Have a look into this : https://stackoverflow.com/questions/31508689/spark-broadcast-variables-it-appears-that-you-are-attempting-to-reference-spar – Jim Todd Mar 21 '19 at 11:53
  • Possible duplicate of [Parallelizing independent actions on the same DataFrame in Spark](https://stackoverflow.com/questions/48378078/parallelizing-independent-actions-on-the-same-dataframe-in-spark) – user10938362 Mar 21 '19 at 12:17
  • @JimTodd [That's extremely bad advice}(https://stackoverflow.com/q/51153656/10938362) – user10938362 Mar 21 '19 at 12:18
  • Possible duplicate of [Apache Spark : When not to use mapPartition and foreachPartition?](https://stackoverflow.com/questions/49527047/apache-spark-when-not-to-use-mappartition-and-foreachpartition) – abiratsis Mar 21 '19 at 14:08
  • I considered this one duplicate of the above since it seems that the code within the map function of rdd_urls is executed on the executors. The problem is that you can't use objects such as SparkSession, Dataframes, Datasets in the executors' code only on the driver's code. – abiratsis Mar 21 '19 at 19:16

0 Answers0