1

I just massively shot my foot by writing "pythonic" spark code like this:

# spark = ... getOrCreate() # essentially provided by the environment (Databricks)
with spark.newSession() as session:
     session.catalog.setCurrentDatabase("foo_test")
     do_something_within_database_scope(session)

assert spark.currentDatabase() == "default"

And oh was I surprised that when executing this notebook cell, somehow the cluster terminated.

I read through this answer which tells me, that there can only be one spark context. That is fine. But why is exiting a session terminating the underlying context? Is there some requirement for this or is this just a design flaw in pyspark?

I also understand that the session's __exit__ call invokes context.stop() - I want to know why it is implemented like that!

I always think of a session as some user initiated thing, like with databases or http clients which I can create and discard on my own will. If the session provides __enter__ and __exit__ then I try to use it from within a with context to make sure I clean up after I am done.

Is my understanding wrong, or alternatively why does pyspark deviate from that concept?

Edit: I tested this together with databricks-connect which comes with its own pyspark python module, but as pri pointed out below it seems to be implemented the same way in standard pyspark.

Marti Nito
  • 697
  • 5
  • 17
  • How and where are you exiting the session? Or are you talking about coming out of the with block? – pri Nov 15 '21 at 10:58
  • I mean coming out of the with block where `__exit__` of the session is called. `__exit__` in turn calls `self.context.stop()` which inevitably terminates/kills all other sessions using the same context. – Marti Nito Nov 15 '21 at 11:39

1 Answers1

0

I looked at the code, and it calls below method:

@since(2.0)
def __exit__(
    self,
    exc_type: Optional[Type[BaseException]],
    exc_val: Optional[BaseException],
    exc_tb: Optional[TracebackType],
) -> None:
    """
    Enable 'with SparkSession.builder.(...).getOrCreate() as session: app' syntax.

    Specifically stop the SparkSession on exit of the with block.
    """
    self.stop()

And the stop method is:

@since(2.0)
def stop(self) -> None:
    """Stop the underlying :class:`SparkContext`."""
    from pyspark.sql.context import SQLContext

    self._sc.stop()
    # We should clean the default session up. See SPARK-23228.
    self._jvm.SparkSession.clearDefaultSession()
    self._jvm.SparkSession.clearActiveSession()
    SparkSession._instantiatedSession = None
    SparkSession._activeSession = None
    SQLContext._instantiatedContext = None

So I don't think you can stop just the SparkSession. Whenever a Spark Session gets stopped (irrespective of the way, in this case, when it comes out of 'with' block, __exit__ is being called), it would kill the underlying SparkContext along with it.

Link to the relevant Apache Spark code below:

https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L1029

pri
  • 1,521
  • 2
  • 13
  • 26
  • Hi pri, yes, I looked at the very same code and understand what is going on in the pyspark implementation. My question is more: "Why is it implemented the way it is?" I will try to clarify this above. – Marti Nito Nov 16 '21 at 08:41