I just massively shot my foot by writing "pythonic" spark code like this:
# spark = ... getOrCreate() # essentially provided by the environment (Databricks)
with spark.newSession() as session:
session.catalog.setCurrentDatabase("foo_test")
do_something_within_database_scope(session)
assert spark.currentDatabase() == "default"
And oh was I surprised that when executing this notebook cell, somehow the cluster terminated.
I read through this answer which tells me, that there can only be one spark context. That is fine. But why is exiting a session terminating the underlying context? Is there some requirement for this or is this just a design flaw in pyspark?
I also understand that the session's __exit__
call invokes context.stop()
- I want to know why it is implemented like that!
I always think of a session
as some user initiated thing, like with databases or http clients which I can create and discard on my own will. If the session provides __enter__
and __exit__
then I try to use it from within a with
context to make sure I clean up after I am done.
Is my understanding wrong, or alternatively why does pyspark deviate from that concept?
Edit: I tested this together with databricks-connect
which comes with its own pyspark python module, but as pri pointed out below it seems to be implemented the same way in standard pyspark.