Getting error whith sparkSession while using multiprocessing in PySpark

Question

My code is as follows :

def processFiles(prcFile , spark:SparkSession):
    print(prcFile)
    app_id = spark.sparkContext.getConf().get('spark.app.id')
    app_name = spark.sparkContext.getConf().get('spark.app.name')
    print(app_id)
    print(app_name)
def main(configPath,args):
    config.read(configPath)
    spark: SparkSession = pyspark.sql.SparkSession.builder.appName("multiprocessing").enableHiveSupport().getOrCreate()    
    mprc = multiprocessing.Pool(3)
    lst=glob.glob(config.get('DIT_setup_config', 'prcDetails')+'prc_PrcId_[0-9].json')
    mprc.map(processFiles,zip(lst, repeat(spark.newSession())))

Now I want to pass a new session of Spark (spark.newSession()) and process data accordingly, but I am getting an error that says :

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Any help will be highly appreciable

def __getnewargs__(self): # This method is called when attempting to pickle SparkContext, which is always an error: #该方法在试图序列化SparkContext时被调用，它通常是一个error raise Exception( "It appears that you are attempting to reference SparkContext from a broadcast " "variable, action, or transformation. SparkContext can only be used on the driver, " "not in code that it run on workers. For more information, see SPARK-5063." ) — pvy4917, Oct 24 '18 at 14:30
Possible duplicate of [How to run independent transformations in parallel using PySpark?](https://stackoverflow.com/questions/38048068/how-to-run-independent-transformations-in-parallel-using-pyspark) — 10465355, Oct 24 '18 at 14:36
@Prazy , so what is the way out , how do I pass SparkSession/SparkContext that can be used parallel. — sanjeev kumar, Oct 25 '18 at 05:16
@user10465355 , The issue that I am facing is I want to use a spark.newSession() in parallel and perform various tasks in parallel. What do I pass in map function ? — sanjeev kumar, Oct 25 '18 at 05:28
https://stackoverflow.com/questions/31508689/spark-broadcast-variables-it-appears-that-you-are-attempting-to-reference-spar — Harsha Biyani, Oct 25 '18 at 05:33
@sanjeevkumar Generally mixing multiple forms of parallelism is counter productive, spark is a distributed computation tool so you generally shouldn't need to combine it with more parallelism. If you do, I'd suggesting using python multithreading not multiprocessing. I could go into this more, but I'm not sure quite where to start because I don't know what your end goal is or what issues you're having. — Bi Rico, Oct 25 '18 at 05:45
@HarshaB , thanks for the link I understand that you can not pickle SparkContext, so what Shall I do here ? can you please tell the code line that I need to fix. — sanjeev kumar, Oct 25 '18 at 05:46
@BiRico , My goal is to launch a Spark job and then using its SparkContext launch multiple session i.e spark.newSession() . Each of these new session can then be used to run an independent module. This will help in launching the Spark Job only once and then use the same context until entire processing takes place — sanjeev kumar, Oct 25 '18 at 05:51
Take a look at the multithreading package, I think that's what you want. You can't share a single SparkContext between processes, but you can share it between threads. — Bi Rico, Oct 25 '18 at 05:55

Getting error whith sparkSession while using multiprocessing in PySpark

0 Answers0