I'm running a Jupyter
-spark setup and I want to benchmark my cluster with different input parameters. To make sure the enivorment stays the same I'm trying to reset(restart) the SparkContext
, here is some pseudo code:
import os
import shutil
import pyspark
temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')
i = 0
while i < max_i:
i += 1
if os.path.exists(temp_result_parquet):
shutil.rmtree(temp_result_parquet) # I know I could simply overwrite the parquet
My_DF = do_something(i)
My_DF.write.parquet(temp_result_parquet)
sc.stop()
time.sleep(10)
sc = pyspark.SparkContext(master='spark://ip:here', appName='PySparkShell')
when I do this the first iteration it runs fine but in the second I get the following error:
Py4JJavaError: An error occurred while calling o1876.parquet.
: org.apache.spark.SparkException: Job aborted.
[...]
Caused by: java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2014)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
I tried running the code without the SparkContext
restart but this results in memory issues. So to wipe the slate clean before every iteration I'm trying this. With the weird result that parquet
thinks SparkContext
is down.