I am trying to run a Spark job on a Kubernetes cluster and I need to add additional JARs required by my application (specifically JARs for Apache Iceberg and AWS SDK). Initially, I tried running my Spark job through spark-submit in cluster deploy mode and pointed the packages to be downloaded onto an NFS share so that both the driver and executors have access to it. Specifying the JARs in the network share as part of the extraClassPath option worked fine. You can find the configuration for spark-submit below.
spark-submit \
--deploy-mode cluster --master k8s://https://10.164.64.27:6443 \
--conf spark.kubernetes.namespace=spark-cluster \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--name $1 --conf spark.executor.instances=$2 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0,org.projectnessie:nessie-spark-extensions-3.3_2.12:0.44.0,software.amazon.awssdk:bundle:2.19.21,software.amazon.awssdk:url-connection-client:2.19.21 \
--conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp/spark-data/ \
--conf spark.driver.extraJavaOptions=-Divy.home=/tmp/spark-data/ \
--conf spark.executor.extraJavaOptions=-Divy.cache.dir=/tmp/spark-data/ \
--conf spark.executor.extraJavaOptions=-Divy.home=/tmp/spark-data/ \
--conf spark.executor.extraClassPath=local:///tmp/spark-data/jars/* \
--conf spark.kubernetes.driver.container.image=apache/spark-py:v3.3.2 \
--conf spark.kubernetes.executor.container.image=apache/spark-py:v3.3.2 ${@:3}
Here the network share was loaded at /tmp/spark-data
Now, I am trying to deploy the same application in client mode (so that I can run the application in a Jupyter notebook). I was seeing that this worked fine, but was taking much longer to run in client mode. For example, the Spark job which was taking 4 seconds in cluster mode was now taking over 2 minutes. Digging a little deeper, we figured out the slow down was because the executors were now copying each JAR from the driver to the executor instead of using it directly from my network share as in cluster mode. I tried to build a Docker image as well with all the required JARs so that the drivers doesn't need to copy the JARs, but that didn't work either and the driver was still copying the JARs to the executors. Is there way to tell Spark to not copy the JARs in client mode and directly use the JARs on my network share or present locally on the Docker image?