7

I'm trying to submit my Pyspark application to a Kubernetes cluster (Minikube) using spark-submit:

./bin/spark-submit \
   --master k8s://https://192.168.64.4:8443 \
   --deploy-mode cluster \
   --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 \
   --conf spark.kubernetes.container.image='pyspark:dev' \
   --conf spark.kubernetes.container.image.pullPolicy='Never' \
   local:///main.py

The application try to reach a Kafka instance deployed inside the cluster, so I specified the jar dependency:

--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1

The container image I'm using is based on the one I've built with the utility script. I've packed all my python dependencies that my app need inside it.

The driver correctly deploy and get the Kafka package (I can provide the logs if needed) and launch the executor in a new pod.

But then the executor pod crash:

ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaBatchInputPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:407)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

So I did my investigations on the executor pod and found that the jar is not present (as stated in the stack trace) in the $SPARK_CLASSPATH folder (which is set to ':/opt/spark/jars/*')

Do I need to also get and include the dependency in the spark jars folder when building the docker image? (I thought the '--packages' option would also make the executor retrieve the specified jar)

Alexandre Pieroux
  • 219
  • 1
  • 2
  • 13
  • Just tried to add all the jar dependencies from maven and put them in the docker and it work. But I'm confused why I have to retrieve them as the --package flag seems to indicate that it will retrieve them in the driver and executors pods – Alexandre Pieroux Feb 25 '21 at 14:14
  • Hi Alexandre, how did you include the dependencies in the Docker image? I did exactly the same and if I run the job inside the build image it works fine, but not when I deploy it on Kubernetes. – Lorenz Jun 26 '21 at 08:15

1 Answers1

2

Did you start out with the official Dockerfile (kubernetes/dockerfiles/spark/bindings/python/Dockerfile) as described in the Docker images section of the documentation? You also need to specify an upload location on a Hadoop-compatible filesystem and make sure that the specified Ivy home and cache directories have the correct permissions, as described in the Dependency Management section.

Example from the docs:

...
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6
--conf spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path
--conf spark.hadoop.fs.s3a.access.key=...
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.hadoop.fs.s3a.secret.key=....
--conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp"
...
Bernhard Stadler
  • 1,725
  • 14
  • 24
  • 1
    Yes I started with the "official Docker" image (which is simply calling the given script to build one). I did came across (after this post) that I in fact needed to create a shared volumes where ivy can put all the dependencies for it to work. Now that I understood that, I just stick with the all built in docker image as I stated in my comment as it is fine in my case (deps are less than 2Mb). Thanks for the clarification for anyone finding this post and did miss this part in the documentation. – Alexandre Pieroux Apr 29 '21 at 12:16
  • Can you please check on the it works if i create into the docker images it works .. but to pass jar files to extract data https://stackoverflow.com/questions/69902575/spark-on-kubernetes-with-minio-postgres-minio-unable-to-create-executor-du – Rafa Nov 10 '21 at 16:22