How can I tear down a SparkSession and create a new one within one application?

Question

I have a pyspark program with multiple independent modules that can each independently process data to meet my various needs. But they can also be chained together to process data in a pipeline. Each of these modules builds a SparkSession and executes perfectly on their own.

However, when I try to run them serially within the same python process, I run into issues. At the moment when the second module in the pipeline executes, spark complains that the SparkContext I am attempting to use has been stopped:

py4j.protocol.Py4JJavaError: An error occurred while calling o149.parquet.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

Each of these modules builds a SparkSession at the beginning of execution and stops the sparkContext at the end of its process. I build and stop sessions/contexts like so:

session = SparkSession.builder.appName("myApp").getOrCreate()
session.stop()

According to official documentation, getOrCreate "gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder." But I don't want this behavior (this behavior where the process attempts to get an existing session). I can't find any way to disable it, and I can't figure out how to destroy the session -- I only know how to stop its associated SparkContext.

How can I build new SparkSessions in independent modules, and execute them in sequence in the same Python process without previous sessions interfering with the newly created ones?

The following is an example of the project structure:

main.py

import collect
import process

if __name__ == '__main__':
    data = collect.execute()
    process.execute(data)

collect.py

import datagetter

def execute(data=None):
    session = SparkSession.builder.appName("myApp").getOrCreate()

    data = data if data else datagetter.get()
    rdd = session.sparkContext.parallelize(data)
    [... do some work here ...]
    result = rdd.collect()
    session.stop()
    return result

process.py

import datagetter

def execute(data=None):
    session = SparkSession.builder.appName("myApp").getOrCreate()
    data = data if data else datagetter.get()
    rdd = session.sparkContext.parallelize(data)
    [... do some work here ...]
    result = rdd.collect()
    session.stop()
    return result

zero323 · Accepted Answer · 2017-01-07T12:01:50.840

Long story short, Spark (including PySpark) is not designed to handle multiple contexts in a single application. If you're interested in JVM side of the story I would recommend reading SPARK-2243 (resolved as won't fix).

There is a number of design decisions made in PySpark which reflects that including, but not limited to a singleton Py4J gateway. Effectively you cannot have multiple SparkContexts in a single application. SparkSession is not only bound to SparkContext but also introduces problems of its own, like handling local (standalone) Hive metastore if one is used. Moreover there functions which use SparkSession.builder.getOrCreate internally and depend on the behavior you see right now. A notable example is UDF registration. Other functions may exhibit unexpected behavior if multiple SQL contexts are present (for example RDD.toDF).

Multiple contexts are not only unsupported but also, in my personal opinion, violate single responsibility principle. Your business logic shouldn't be concerned with all the setup, cleanup and configuration details.

My personal recommendations are as follows:

If application consist of multiple coherent modules which can be composed and benefit from a single execution environment with caching and common metastore initialize all required contexts in the application entry point and pass these down to individual pipelines when necessary:

main.py:

from pyspark.sql import SparkSession

import collect
import process

if __name__ == "__main__":
    spark: SparkSession = ...

    # Pass data between modules
    collected = collect.execute(spark)
    processed = process.execute(spark, data=collected)
    ...
    spark.stop()

collect.py / process.py:

from pyspark.sql import SparkSession

def execute(spark: SparkSession, data=None):
    ...

Otherwise (it seems to be the case here based on your description) I would design entrypoint to execute a single pipeline and use external worfklow manager (like Apache Airflow or Toil) to handle the execution.

It is not only cleaner but also allows for much more flexible fault recovery and scheduling.

The same thing can be of course done with builders but like a smart person once said: Explicit is better than implicit.
- main.py
```
import argparse

from pyspark.sql import SparkSession

import collect
import process

pipelines = {"collect": collect, "process": process}

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--pipeline')
    args = parser.parse_args()

    spark: SparkSession = ...

    # Execute a single pipeline only for side effects
    pipelines[args.pipeline].execute(spark)
    spark.stop()
```
- collect.py / process.py as in the previous point.

One way or another I would keep one and only one place where context is set up and one and only one place were it is tear down.

This answer was very helpful when I was debugging my own issue: attempting to re-create Spark Contexts in one Python app. It was in a test suite, where we wanted to re-create a context with different JAR dependencies each time. Didn't behave as expected, and we saw the root cause in the underlying `java` proc being executed. Turns out the `java` call is only constructed on the 1st session (using any associated configuration for that session), and never re-created. I can see why it is this way. TL;DR: Any subsequent config from re-created contexts is ignored. As the answer says: Don't do this. — mattjw, May 01 '20 at 11:06
Hi @zero323, I saw another post on SOF that aims to set new configurations and pass the to a new SparkContext. Would that solve this? https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark — Quan Bui, Oct 13 '22 at 09:09

score 7 · Answer 2 · edited Jan 06 '17 at 21:32

Here is a workaround, but not a solution:

I discovered that the SparkSession class in the source code contains the following __init__ (I've removed irrelevant lines of code from display here):

_instantiatedContext = None

def __init__(self, sparkContext, jsparkSession=None):
    self._sc = sparkContext
    if SparkSession._instantiatedContext is None:
        SparkSession._instantiatedContext = self

Therefore, I can workaround my problem by setting the _instantiatedContext attribute of the session to None after calling session.stop(). When the next module executes, it calls getOrCreate() and does not find the previous _instantiatedContext, so it assigns a new sparkContext.

This isn't a very satisfying solution but it serves as a workaround to meet my current needs. I'm unsure of whether or not this entire approach of starting independent sessions is anti-pattern or just unusual.

actually it's a pretty satisfying solution. It's kinda awkward on one side but it does what is needed wrt starting a new session with a different configuration — WestCoastProjects, Jan 25 '23 at 13:35

score 0 · Answer 3 · answered Jan 06 '17 at 22:01

Why would you not pass in the same spark session instance into the multiple stages of your pipeline? You could use a builder pattern. Sounds to me like you are collecting result sets at the end each stage and then passing that data into the next stage. Consider leaving the data in the cluster in the same session, and passing the session reference and result reference from stage to stage, until your application is complete.

In other words, put the

session = SparkSession.builder...

...in your main.

score 0 · Answer 4 · answered Aug 16 '18 at 15:43

0

spark_current_session = SparkSession. \
                             builder. \
                             appName('APP'). \
                             config(conf=SparkConf()). \
                             getOrCreate()
spark_current_session.newSession()

you can create a new session from current session

answered Aug 16 '18 at 15:43

Veera Marni

157
1
7

7

Note that any config passed in here won't actually be applied if a previous session exists; the config will still be taken from the old session's `SparkContext`. – Ben Watson May 09 '19 at 14:25

How can I tear down a SparkSession and create a new one within one application?

4 Answers4

Linked