I'm working on a POC for getting a Spark cluster set up to use Kubernetes for resource management using AKS (Azure Kubernetes Service). I'm using spark-submit to submit pyspark applications to k8s in cluster mode and I've been successful in getting applications to run fine.
I got Azure file share set up to store application scripts and Persistent Volume and a Persistent Volume Claim pointing to this file share to allow Spark to access the scripts from Kubernetes. This works fine for applications that don't write any output, like the pi.py example given in the spark source code, but writing any kind of outputs fails in this setup. I tried running a script to get wordcounts and the line
wordCounts.saveAsTextFile(f"./output/counts")
causes an exception where wordCounts is an rdd.
Traceback (most recent call last):
File "/opt/spark/work-dir/wordcount2.py", line 14, in <module>
wordCounts.saveAsTextFile(f"./output/counts")
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1570, in saveAsTextFile
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o65.saveAsTextFile.
: ExitCodeException exitCode=1: chmod: changing permissions of '/opt/spark/work-dir/output/counts': Operation not permitted
The directory "counts" has been created from the spark application just fine, so it seems like it has required permissions, but this subsequent chmod
that spark tries to perform internally fails. I haven't been able to figure out the cause and what exact configuration I'm missing in my commands that's causing this. Any help would be greatly appreciated.
The kubectl
version I'm using is
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"881d4a5a3c0f4036c714cfb601b377c4c72de543", GitTreeState:"clean", BuildDate:"2021-10-21T05:13:01Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
The spark version is 2.4.5 and the command I'm using is
<SPARK_PATH>/bin/spark-submit --master k8s://<HOST>:443 \
--deploy-mode cluster \
--name spark-pi3 \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=docker.io/datamechanics/spark:2.4.5-hadoop-3.1.0-java-8-scala-2.11-python-3.7-dm14 \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.azure-fileshare-pvc.options.claimName=azure-fileshare-pvc \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.azure-fileshare-pvc.mount.path=/opt/spark/work-dir \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.azure-fileshare-pvc.options.claimName=azure-fileshare-pvc \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.azure-fileshare-pvc.mount.path=/opt/spark/work-dir \
--verbose /opt/spark/work-dir/wordcount2.py
The PV and PVC are pretty basic. The PV yml is:
apiVersion: v1
kind: PersistentVolume
metadata:
name: azure-fileshare-pv
labels:
usage: azure-fileshare-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
azureFile:
secretName: azure-storage-secret
shareName: dssparktestfs
readOnly: false
secretNamespace: spark-operator
The PVC yml is:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: azure-fileshare-pvc
# Set this annotation to NOT let Kubernetes automatically create
# a persistent volume for this volume claim.
annotations:
volume.beta.kubernetes.io/storage-class: ""
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
selector:
# To make sure we match the claim with the exact volume, match the label
matchLabels:
usage: azure-fileshare-pv
Let me know if more info is needed.