I have already looked at several similar questions - here, here and some other blog posts and Stack overflow questions.
I have the below PySpark script and looking to read data from a GCS bucket
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.getOrCreate()
bucket_name="my-gcs-bucket"
path=f"gs://{bucket_name}/path/to/file.csv"
df=spark.read.csv(path, header=True)
print(df.head())
which fails with the error -
py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
My environment setup Dockerfile
is something like below:
FROM openjdk:11.0.11-jre-slim-buster
# install a whole bunch of apt-get dev essential libraries (unixodbc-dev, libgdbm-dev...)
# some other setup for other services
# copy my repository, requirements file
# install Python-3.9 and activate a venv
RUN pip install pyspark==3.3.1
There is no env variable like HADOOP_HOME, SPARK_HOME, PYSPARK_PYTHON etc. Just a plain installation of PySpark.
I have tried to run -
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("spark.jars.package", "/path/to/jar/gcs-connector-hadoop3-2.2.10.jar") \
.getOrCreate()
or
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.getOrCreate()
and some other solutions, but I am still getting the same error
My question is -
in such a setup, what all do I need to do to get this script running? I have seen answers on updating pom files, core-site.xml file etc. but looks like simple pyspark installation does not come with those files
how can I make jar installs/setup be a default spark setting in pyspark only installation? I hope to simply run this script -
python path/to/file.py
without passing any arguments with spark-submit, setting it in the sparksession.config etc. I know if we have a regular spark installation, we can add the default jars to spark-defaults.conf file, but looks like plain PySpark installation does not come with those file either
Thank you in advance!