I'm trying to run a spark application using spark operator
for my example I need some spark packages however every time that I deploy I need to re-download those packages that some times takes I long time to do. I want some efficient way so I don't need to download every time that I deploy a modification to the manifest.
Dockerfile
# Build stage
FROM bitnami/spark:3.3.2-debian-11-r20 AS builder
USER root
# Other python requirements
COPY requirements.txt /
RUN pip install --no-cache-dir -r /requirements.txt
# Copy your application code
COPY . /opt/bitnami/spark/
# Set user to root temporarily for copying files
USER 1001
Spark Operator Manifest
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: pyspark-example
namespace: example
spec:
type: Python
pythonVersion: "3"
sparkVersion: 3.3.2
mode: cluster
image: "example:v1.0.13"
imagePullPolicy: IfNotPresent
mainApplicationFile: local:/opt/bitnami/spark/pyspark-app.py
restartPolicy:
type: Never
driver:
env:
- name: AWS_REGION
value: us-east-1
cores: 1 # Number of CPU cores for the Spark driver
coreLimit: 1200m
memory: 1g # Memory for the Spark driver
labels:
version: 3.1.1
executor:
env:
- name: AWS_REGION
value: us-east-1
cores: 1 # Number of CPU cores for each Spark executor
instances: 2 # Number of executor instances to run
memory: 1g # Memory for each Spark executor
labels:
version: 3.1.1
deps:
# These are the depency that takes long time to download
packages:
- "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
- "software.amazon.awssdk:bundle:2.17.178"
- "software.amazon.awssdk:url-connection-client:2.17.178"
- "software.amazon.awssdk:s3:2.17.133"
- "org.apache.hadoop:hadoop-aws:3.2.2"
pyspark-app.py
from pyspark import SparkConf
from pyspark.sql import SparkSession
# Create a SparkConf object
conf = SparkConf()
conf.setAppName("Iceberg Test")
# conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set(
"spark.jars.packages",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
)
conf.set(
"spark.sql.catalog.glue_catalog",
"org.apache.iceberg.spark.SparkCatalog",
)
conf.set("spark.master", "k8s://https://127.0.0.1:32773")
# Create a SparkSession based on the SparkConf
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Create a DataFrame and perform a simple operation
data = [("Alice", 25), ("Bob", 30), ("Carol", 28)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
df.show()
# Stop the Spark session
spark.stop()
rba
# Create the Role "spark-operator-permissions"
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: example
name: spark-operator-permissions
rules:
- apiGroups: [""]
resources: ["configmaps", "pods", "services"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
---
# Create the RoleBinding "spark-pod-reader-binding"
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-operator-binding
namespace: example
subjects:
- kind: ServiceAccount
name: default
namespace: example
roleRef:
kind: Role
name: spark-operator-permissions
apiGroup: rbac.authorization.k8s.io
I tried to put every jar inside the image as
ARG NAME_JAR=aws-java-sdk-bundle-1.11.704.jar
RUN curl ${REPO}com/amazonaws/aws-java-sdk-bundle/1.11.704/${NAME_JAR} --output /opt/bitnami/spark/jars/${NAME_JAR}
but this doesn't seem the most optimal solution