I am new to the world of Spark and Kubernetes. I built a Spark docker image using the official Spark 3.0.1 bundled with Hadoop 3.2 using the docker-image-tool.sh utility.
I have also created another docker image for Jupyter notebook and am trying to run spark on Kubernetes in client mode. I first run my Jupyter notebook as a pod, do a port forward using kubectl and access the notebook UI from my system at localhost:8888 . All seems to be working fine. I am able to run commands successfully from the notebook.
Now I am trying to access Azure Data Lake Gen2 from my notebook using Hadoop ABFS connector. I am setting the Spark context as below.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Create Spark config for our Kubernetes based cluster manager
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")
sparkConf.setAppName("spark")
sparkConf.set("spark.kubernetes.container.image", "<<my_repo>>/spark-py:latest")
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.driver.memory", "512m")
sparkConf.set("spark.executor.memory", "512m")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
sparkConf.set("spark.driver.port", "29413")
sparkConf.set("spark.driver.host", "my-notebook-deployment.spark.svc.cluster.local")
sparkConf.set("fs.azure.account.auth.type", "SharedKey")
sparkConf.set("fs.azure.account.key.<<storage_account_name>>.dfs.core.windows.net","<<account_key>>")
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
And then I am running the below command to read a csv file present in the ADLS location
df = spark.read.csv("abfss://<<container>>@<<storage_account>>.dfs.core.windows.net/")
On runnining it I am getting the error Py4JJavaError: An error occurred while calling o443.csv. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
After some research, I found that I would have to explicitly include the hadoop-azure jar for the appropriate classes to be available. I downloaded the jar from here , put it in the /spark-3.0.1-bin-hadoop3.2/jars folder and built the image again.
Unfortunately I am still getting this error. I manually verified that the jar file is indeed present in the docker image and contains the class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem
I looked at the entrypoint.sh present at spark-3.0.1-bin-hadoop3.2\kubernetes\dockerfiles\spark
folder which is the entry point of our spark docker image. It adds all the packages present in the spark-3.0.1-bin-hadoop3.2\jar\
folder in the class path.
# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
if [ -n "${HADOOP_HOME}" ] && [ -z "${SPARK_DIST_CLASSPATH}" ]; then
export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
fi
if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
fi
According to my understanding spark should be able to find the class in its classpath with any addition setJar configuration.
Can someone please guide me how to resolve this? I might be missing something quite basic here.