I am trying to figure out how to effectively use the new Spark-Connect feature of Spark version >= 3.4.0. Specifically, I want so set up a kubernetes Spark cluster where various applications (mainly pyspark) will connect and submit their workloads. It is my understanding (and please correct me if I'm wrong) that by running the command
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
a shared Spark Context is created, and it is not possible to submit further configurations (i.e. driver/executor cores and memory, packages, etc.) after it is created.
The command creates a spark driver instance inside the pod running the spark connect server (i.e. in client mode). I was also able to set kubernetes as a master, and thus have spark executors be created dynamically uppon task submission from my clients application.
What I want to know is if it is possible to configure the spark cluster in "cluster mode" instead, so that the driver is instantiated in a separate pod from the spark-connect server?
Also, is it possible to run the spark-connect server in high-availability mode?
Finally, are there any configurations that can be passed from the spark session builder object, something like:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.remote("sc://spark-connect.spark.svc.cluster.local:15002")
.config("spark.xxx.yyy", "some-value")
.getOrCreate())
Thanks to anyone who can answer!