0

I am trying to figure out how to effectively use the new Spark-Connect feature of Spark version >= 3.4.0. Specifically, I want so set up a kubernetes Spark cluster where various applications (mainly pyspark) will connect and submit their workloads. It is my understanding (and please correct me if I'm wrong) that by running the command

./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0

a shared Spark Context is created, and it is not possible to submit further configurations (i.e. driver/executor cores and memory, packages, etc.) after it is created.

The command creates a spark driver instance inside the pod running the spark connect server (i.e. in client mode). I was also able to set kubernetes as a master, and thus have spark executors be created dynamically uppon task submission from my clients application.

What I want to know is if it is possible to configure the spark cluster in "cluster mode" instead, so that the driver is instantiated in a separate pod from the spark-connect server?

Also, is it possible to run the spark-connect server in high-availability mode?

Finally, are there any configurations that can be passed from the spark session builder object, something like:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
.remote("sc://spark-connect.spark.svc.cluster.local:15002")
.config("spark.xxx.yyy", "some-value")
.getOrCreate())

Thanks to anyone who can answer!

0 Answers0