0

I'm trying to run a spark application using spark operator
for my example I need some spark packages however every time that I deploy I need to re-download those packages that some times takes I long time to do. I want some efficient way so I don't need to download every time that I deploy a modification to the manifest.

Dockerfile

# Build stage
FROM bitnami/spark:3.3.2-debian-11-r20 AS builder
USER root

# Other python requirements
COPY requirements.txt /
RUN pip install --no-cache-dir -r /requirements.txt

# Copy your application code
COPY . /opt/bitnami/spark/

# Set user to root temporarily for copying files
USER 1001

Spark Operator Manifest

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-example
  namespace: example
spec:
  type: Python
  pythonVersion: "3"
  sparkVersion: 3.3.2
  mode: cluster
  image: "example:v1.0.13"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: local:/opt/bitnami/spark/pyspark-app.py
  restartPolicy:
      type: Never
  driver:
    env:
      - name: AWS_REGION
        value: us-east-1
    cores: 1 # Number of CPU cores for the Spark driver
    coreLimit: 1200m
    memory: 1g # Memory for the Spark driver
    labels:
      version: 3.1.1
  executor:
    env:
      - name: AWS_REGION
        value: us-east-1
    cores: 1 # Number of CPU cores for each Spark executor
    instances: 2 # Number of executor instances to run
    memory: 1g # Memory for each Spark executor
    labels:
      version: 3.1.1
  deps:
    # These are the depency that takes long time to download
    packages:
      - "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
      - "software.amazon.awssdk:bundle:2.17.178"
      - "software.amazon.awssdk:url-connection-client:2.17.178"
      - "software.amazon.awssdk:s3:2.17.133"
      - "org.apache.hadoop:hadoop-aws:3.2.2"

pyspark-app.py

from pyspark import SparkConf
from pyspark.sql import SparkSession

# Create a SparkConf object
conf = SparkConf()
conf.setAppName("Iceberg Test")
# conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set(
    "spark.jars.packages",
    "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
)
conf.set(
    "spark.sql.catalog.glue_catalog",
    "org.apache.iceberg.spark.SparkCatalog",
)
conf.set("spark.master", "k8s://https://127.0.0.1:32773")

# Create a SparkSession based on the SparkConf
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Create a DataFrame and perform a simple operation
data = [("Alice", 25), ("Bob", 30), ("Carol", 28)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

# Stop the Spark session
spark.stop()

rba

# Create the Role "spark-operator-permissions"
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: example
  name: spark-operator-permissions
rules:
- apiGroups: [""]
  resources: ["configmaps", "pods", "services"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]
---
# Create the RoleBinding "spark-pod-reader-binding"
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-operator-binding
  namespace: example
subjects:
- kind: ServiceAccount
  name: default
  namespace: example
roleRef:
  kind: Role
  name: spark-operator-permissions
  apiGroup: rbac.authorization.k8s.io

I tried to put every jar inside the image as

ARG NAME_JAR=aws-java-sdk-bundle-1.11.704.jar

RUN curl ${REPO}com/amazonaws/aws-java-sdk-bundle/1.11.704/${NAME_JAR} --output /opt/bitnami/spark/jars/${NAME_JAR}

but this doesn't seem the most optimal solution

DevScheffer
  • 491
  • 4
  • 15

0 Answers0