1

I'm running a Jupyter-spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext, here is some pseudo code:

import os
import shutil
import pyspark

temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')
i = 0 

while i < max_i:
    i += 1
    if os.path.exists(temp_result_parquet):
        shutil.rmtree(temp_result_parquet) # I know I could simply overwrite the parquet

    My_DF = do_something(i)
    My_DF.write.parquet(temp_result_parquet)

    sc.stop()
    time.sleep(10)
    sc = pyspark.SparkContext(master='spark://ip:here', appName='PySparkShell')

when I do this the first iteration it runs fine but in the second I get the following error:

Py4JJavaError: An error occurred while calling o1876.parquet.
: org.apache.spark.SparkException: Job aborted.
[...]
Caused by: java.lang.IllegalStateException: SparkContext has been shutdown
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2014)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)

I tried running the code without the SparkContext restart but this results in memory issues. So to wipe the slate clean before every iteration I'm trying this. With the weird result that parquet thinks SparkContext is down.

himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
Thagor
  • 820
  • 2
  • 10
  • 33

1 Answers1

0

Long story short, Spark (including PySpark) is not designed to handle multiple contexts in a single application. If you're interested in JVM side of the story I would recommend reading SPARK-2243 (resolved as won't fix).

There is a number of design decisions made in PySpark which reflects that including, but not limited to a singleton Py4J gateway. Effectively you cannot have multiple SparkContexts in a single application. SparkSession is not only bound to SparkContext but also introduces problems of its own, like handling local (standalone) Hive metastore if one is used. Moreover there functions which use SparkSession.builder.getOrCreate internally and depend on the behavior you see right now. A notable example is UDF registration. Other functions may exhibit unexpected behavior if multiple SQL contexts are present (for example RDD.toDF).

Multiple contexts are not only unsupported but also, in my personal opinion, violate single responsibility principle. Your business logic shouldn't be concerned with all the setup, cleanup and configuration details.

Source