1

I am trying to submit a Pyspark job on ADLS Gen2 to Azure-Kubernetes-Services (AKS) and get the following exception:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:191)
    at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
    at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$6(SparkSubmit.scala:365)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:365)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
    ... 27 more

My spark-submit looks like this:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://XXX \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.file.upload.path=file:///tmp \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=XXX \
--conf spark.hadoop.fs.azure.account.auth.type.XXX.dfs.core.windows.net=SharedKey \
--conf spark.hadoop.fs.azure.account.key.XXX.dfs.core.windows.net=XXX \
--py-files abfss://data@XXX.dfs.core.windows.net/py-files/ml_pipeline-0.0.1-py3.8.egg \
abfss://data@XXX.dfs.core.windows.net/py-files/main_kubernetes.py

The job runs just fine on my VM and also loads data from ADLS Gen2 without problems. In this post java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found it is recommended to download the package and add it to the spark/jars folder. But I don't know where to download it and why it has to be included in the first place, if it works fine locally.

EDIT: So I managed to include the jars in the Docker container. And if I ssh into that container and run the Job it works fine and loads the files from the ADLS. But if I submit the job to Kubernetes it throws the same exception as before. Please, can someone help?

Spark 3.1.1, Python 3.8.5, Ubuntu 18.04

Lorenz
  • 123
  • 1
  • 9
  • please try to use `spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.0` to run. It will download package from maven. – Jim Xu Jun 03 '21 at 06:41
  • Hi Jim, thank you for the comment. Unfortunately that doesn't work either. I get the following excepton: `Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-373409f0-dc3b-40f1-a8a1-307e365b16a1-1.0.xml (No such file or directory)` – Lorenz Jun 03 '21 at 08:42
  • please refer to https://stackoverflow.com/questions/66722861/pyspark-packages-installation-on-kubernetes-with-spark-submit-ivy-cache-file-no – Jim Xu Jun 03 '21 at 08:47
  • I tried to manually include the jars but I run into dependency issues. How do I know which version of hadoop-azure to use and how can I download the jar including all dependencies? – Lorenz Jun 03 '21 at 09:53
  • The version should be same as your hadoop's version. – Jim Xu Jun 03 '21 at 12:56

1 Answers1

1

So I managed to fix my problem. It is definitely a workaround but it works.

I modified the PySpark Docker container by changing the entrypoint to:

ENTRYPOINT [ "/opt/entrypoint.sh" ]

Now I was able to run the container without it exiting immediately:

docker run -td <docker_image_id>

And could ssh into it:

docker exec -it <docker_container_id> /bin/bash

At this point I could submit the spark job inside the container with the --package flag:

$SPARK_HOME/bin/spark-submit \
  --master local[*] \
  --deploy-mode client \
  --name spark-python \
  --packages org.apache.hadoop:hadoop-azure:3.2.0 \
  --conf spark.hadoop.fs.azure.account.auth.type.user.dfs.core.windows.net=SharedKey \
  --conf spark.hadoop.fs.azure.account.key.user.dfs.core.windows.net=xxx \
  --files "abfss://data@user.dfs.core.windows.net/config.yml" \
  --py-files "abfss://data@user.dfs.core.windows.net/jobs.zip" \
  "abfss://data@user.dfs.core.windows.net/main.py"

Spark then downloaded the required dependencies and saved them under /root/.ivy2 in the container and executed the job succesfully.

I copied the whole folder from the container onto the host machine:

sudo docker cp <docker_container_id>:/root/.ivy2/ /opt/spark/.ivy2/

And modified the Dockerfile again to copy the folder into the image:

COPY .ivy2 /root/.ivy2

Finally I could submit the job to Kubernetes with this newly build image and everything runs as expected.

Lorenz
  • 123
  • 1
  • 9