Why is pyspark implemented such that exiting a session is stopping the underlying spark context?

Question

I just massively shot my foot by writing "pythonic" spark code like this:

# spark = ... getOrCreate() # essentially provided by the environment (Databricks)
with spark.newSession() as session:
     session.catalog.setCurrentDatabase("foo_test")
     do_something_within_database_scope(session)

assert spark.currentDatabase() == "default"

And oh was I surprised that when executing this notebook cell, somehow the cluster terminated.

I read through this answer which tells me, that there can only be one spark context. That is fine. But why is exiting a session terminating the underlying context? Is there some requirement for this or is this just a design flaw in pyspark?

I also understand that the session's __exit__ call invokes context.stop() - I want to know why it is implemented like that!

I always think of a session as some user initiated thing, like with databases or http clients which I can create and discard on my own will. If the session provides __enter__ and __exit__ then I try to use it from within a with context to make sure I clean up after I am done.

Is my understanding wrong, or alternatively why does pyspark deviate from that concept?

Edit: I tested this together with databricks-connect which comes with its own pyspark python module, but as pri pointed out below it seems to be implemented the same way in standard pyspark.

How and where are you exiting the session? Or are you talking about coming out of the with block? — pri, Nov 15 '21 at 10:58
I mean coming out of the with block where `__exit__` of the session is called. `__exit__` in turn calls `self.context.stop()` which inevitably terminates/kills all other sessions using the same context. — Marti Nito, Nov 15 '21 at 11:39

score 0 · Answer 1 · answered Nov 15 '21 at 14:43

I looked at the code, and it calls below method:

@since(2.0)
def __exit__(
    self,
    exc_type: Optional[Type[BaseException]],
    exc_val: Optional[BaseException],
    exc_tb: Optional[TracebackType],
) -> None:
    """
    Enable 'with SparkSession.builder.(...).getOrCreate() as session: app' syntax.

    Specifically stop the SparkSession on exit of the with block.
    """
    self.stop()

And the stop method is:

@since(2.0)
def stop(self) -> None:
    """Stop the underlying :class:`SparkContext`."""
    from pyspark.sql.context import SQLContext

    self._sc.stop()
    # We should clean the default session up. See SPARK-23228.
    self._jvm.SparkSession.clearDefaultSession()
    self._jvm.SparkSession.clearActiveSession()
    SparkSession._instantiatedSession = None
    SparkSession._activeSession = None
    SQLContext._instantiatedContext = None

So I don't think you can stop just the SparkSession. Whenever a Spark Session gets stopped (irrespective of the way, in this case, when it comes out of 'with' block, __exit__ is being called), it would kill the underlying SparkContext along with it.

Link to the relevant Apache Spark code below:

https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L1029

Hi pri, yes, I looked at the very same code and understand what is going on in the pyspark implementation. My question is more: "Why is it implemented the way it is?" I will try to clarify this above. — Marti Nito, Nov 16 '21 at 08:41

Why is pyspark implemented such that exiting a session is stopping the underlying spark context?

1 Answers1