5

I am new to the world of Spark and Kubernetes. I built a Spark docker image using the official Spark 3.0.1 bundled with Hadoop 3.2 using the docker-image-tool.sh utility.

I have also created another docker image for Jupyter notebook and am trying to run spark on Kubernetes in client mode. I first run my Jupyter notebook as a pod, do a port forward using kubectl and access the notebook UI from my system at localhost:8888 . All seems to be working fine. I am able to run commands successfully from the notebook.

Now I am trying to access Azure Data Lake Gen2 from my notebook using Hadoop ABFS connector. I am setting the Spark context as below.

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
# Create Spark config for our Kubernetes based cluster manager


sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443")
sparkConf.setAppName("spark")
sparkConf.set("spark.kubernetes.container.image", "<<my_repo>>/spark-py:latest")
sparkConf.set("spark.kubernetes.namespace", "spark")
sparkConf.set("spark.executor.instances", "3")
sparkConf.set("spark.executor.cores", "2")
sparkConf.set("spark.driver.memory", "512m")
sparkConf.set("spark.executor.memory", "512m")
sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
sparkConf.set("spark.driver.port", "29413")
sparkConf.set("spark.driver.host", "my-notebook-deployment.spark.svc.cluster.local")

sparkConf.set("fs.azure.account.auth.type", "SharedKey")
sparkConf.set("fs.azure.account.key.<<storage_account_name>>.dfs.core.windows.net","<<account_key>>")

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()

And then I am running the below command to read a csv file present in the ADLS location

df = spark.read.csv("abfss://<<container>>@<<storage_account>>.dfs.core.windows.net/")

On runnining it I am getting the error Py4JJavaError: An error occurred while calling o443.csv. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found

After some research, I found that I would have to explicitly include the hadoop-azure jar for the appropriate classes to be available. I downloaded the jar from here , put it in the /spark-3.0.1-bin-hadoop3.2/jars folder and built the image again.

Unfortunately I am still getting this error. I manually verified that the jar file is indeed present in the docker image and contains the class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem

I looked at the entrypoint.sh present at spark-3.0.1-bin-hadoop3.2\kubernetes\dockerfiles\spark folder which is the entry point of our spark docker image. It adds all the packages present in the spark-3.0.1-bin-hadoop3.2\jar\ folder in the class path.

# If HADOOP_HOME is set and SPARK_DIST_CLASSPATH is not set, set it here so Hadoop jars are available to the executor.
# It does not set SPARK_DIST_CLASSPATH if already set, to avoid overriding customizations of this value from elsewhere e.g. Docker/K8s.
if [ -n "${HADOOP_HOME}"  ] && [ -z "${SPARK_DIST_CLASSPATH}"  ]; then
  export SPARK_DIST_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
fi

if ! [ -z ${HADOOP_CONF_DIR+x} ]; then
  SPARK_CLASSPATH="$HADOOP_CONF_DIR:$SPARK_CLASSPATH";
fi

According to my understanding spark should be able to find the class in its classpath with any addition setJar configuration.

Can someone please guide me how to resolve this? I might be missing something quite basic here.

Ali Abbas
  • 126
  • 2
  • 9
  • I found this useful and it's resolved my issue local after put hadoop-azure and azure-storage jar in install spark location in C:\Spark\jar\ in this folder – Palash Mondal Jan 30 '23 at 06:19

2 Answers2

3

Applying the solution provided here...

How do we specify maven dependencies in pyspark

We can start a Spark session and include the required Jar from Maven.

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("local[*]")\
        .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1')\
        .getOrCreate()
Dan Ciborowski - MSFT
  • 6,807
  • 10
  • 53
  • 88
1

Looks like I needed to add the hadoop-azure package in the Docker image which ran Jupyter notebook and acted as Spark driver. Its working as expected after doing that.

Ali Abbas
  • 126
  • 2
  • 9
  • Hello Ali, I am facing the same Problem right now. Did you only include the hadoop-azure package in the jar folder? And did you have to modify the name of the package or any additional steps? Unfortunately it doesn't work for me. – Lorenz Jun 28 '21 at 09:30
  • Please @Ali Abbas Can you show us a code snippet on how you approached this? Having the same issue as well. PLeaseeee – Sillians Feb 28 '22 at 08:51