I am trying to run a very simple spark job that will Extract some data from my Azure Data Lake and print it on screen using the spark-on-k8s operator. For that I have built an image using a Dockerfile that looks like this:
FROM gcr.io/spark-operator/spark-py:v3.1.1
USER root:root
RUN mkdir -p /app
WORKDIR /app
COPY jars/ /opt/spark/jars
COPY simple-etl-job.py /app
WORKDIR /app
USER 1001
And when I launch it as a job on Kubernetes it returns me an error saying:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.io.IOException: No FileSystem for scheme: abfss
The strange thing is, I am copying to the /opt/spark/jars
directory the same jars used for a local spark-submit
job that does the same as my K8s code and runs successfully.
Those jars are:
- hadoop-azure-3.2.0.jar
- wildfly-openssl-1.0.4.Final.jar
- hadoop-azure-datalake-3.2.0.jar
What else could I possibly be doing wrong?
P.S.: Here is my spark CRD:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: simple-spark-etl-job
namespace: spark-operator
spec:
type: Python
mode: cluster
image: "<my-org>/<my-image>:<my-tag>"
imagePullPolicy: Always
mainApplicationFile: "local:///app/simple-etl-job.py"
sparkVersion: "3.1.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.1.1
serviceAccount: default
executor:
cores: 1
instances: 2
memory: "512m"
labels:
version: 3.1.1