0

I am trying to run a very simple spark job that will Extract some data from my Azure Data Lake and print it on screen using the spark-on-k8s operator. For that I have built an image using a Dockerfile that looks like this:

FROM gcr.io/spark-operator/spark-py:v3.1.1

USER root:root

RUN mkdir -p /app
WORKDIR /app

COPY jars/ /opt/spark/jars
COPY simple-etl-job.py /app
WORKDIR /app

USER 1001

And when I launch it as a job on Kubernetes it returns me an error saying:

py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.io.IOException: No FileSystem for scheme: abfss

The strange thing is, I am copying to the /opt/spark/jars directory the same jars used for a local spark-submit job that does the same as my K8s code and runs successfully. Those jars are:

  • hadoop-azure-3.2.0.jar
  • wildfly-openssl-1.0.4.Final.jar
  • hadoop-azure-datalake-3.2.0.jar

What else could I possibly be doing wrong?

P.S.: Here is my spark CRD:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: simple-spark-etl-job
  namespace: spark-operator
spec:
  type: Python
  mode: cluster
  image: "<my-org>/<my-image>:<my-tag>"
  imagePullPolicy: Always
  mainApplicationFile: "local:///app/simple-etl-job.py"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: default
  executor:
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 3.1.1
  • Hello @Murilo Mendonça , Please were you able to figure this out!!? optimizing same architecture as yours and I am suck on this particular error message, what step did you take to figure this out? thanks – Sillians Mar 25 '22 at 15:43
  • Hi @Sillians, no I could not and since my team de-prioritized this demand, I did not continue working on a solution. If you figure it out, let me know! – Murilo Mendonça Apr 01 '22 at 13:49
  • 1
    Hello @Murilo Mendonça , thanks for the feedback. Yes, I did. Had to use a different spark-operator image, here is the image: [spark-py:v3.1.1-hadoop3](https://gcr.io/spark-operator/spark-py@sha256:bf2fcd77f2b24bbd812c7a2d3635b6f1d3691d9da53996e1d615a9fbd572b314) , and add the necessary jar files that would help establish connection to Azure Storage Gen2, from maven repository. – Sillians Apr 02 '22 at 20:08

1 Answers1

0

The issue here maybe occuring due openssl installed versions being not compatible with wildfly-openssl-*.jar in new machine or environment or when adding the hadoop-azure package in the Docker image .

Please check if Upgrading wildfly-openssl-*.final.jar to latest version helps . Also Check for JDK versioning mismatch

Also See if order of jars is making any difference as this

kavyaS
  • 8,026
  • 1
  • 7
  • 19
  • Both Spark and JDK are on the same versions on my local machine and the Docker image. It does not seem to be something with the wildfly jar, because the "no scheme abfss" is an error thrown only when I don't have the `hadoop-azure-datalake-3.2.0.jar` in the proper directory. I will also give it a give a try and let you know. – Murilo Mendonça Feb 03 '22 at 13:56
  • as suspected, no changes in the errors thrown with updating wildfly jar as well – Murilo Mendonça Feb 03 '22 at 14:10