2

I have already looked at several similar questions - here, here and some other blog posts and Stack overflow questions.

I have the below PySpark script and looking to read data from a GCS bucket

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName("GCSFilesRead")\
    .getOrCreate()

bucket_name="my-gcs-bucket"
path=f"gs://{bucket_name}/path/to/file.csv"

df=spark.read.csv(path, header=True)
print(df.head())

which fails with the error -

py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)

My environment setup Dockerfile is something like below:

FROM openjdk:11.0.11-jre-slim-buster

# install a whole bunch of apt-get dev essential libraries (unixodbc-dev, libgdbm-dev...)
# some other setup for other services
# copy my repository, requirements file
# install Python-3.9 and activate a venv

RUN pip install pyspark==3.3.1

There is no env variable like HADOOP_HOME, SPARK_HOME, PYSPARK_PYTHON etc. Just a plain installation of PySpark.

I have tried to run -

spark = SparkSession.builder\
    .appName("GCSFilesRead")\
    .config("spark.jars.package", "/path/to/jar/gcs-connector-hadoop3-2.2.10.jar") \
    .getOrCreate()

or

spark = SparkSession.builder\
    .appName("GCSFilesRead")\
    .config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
    .config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
    .getOrCreate()

and some other solutions, but I am still getting the same error

My question is -

  1. in such a setup, what all do I need to do to get this script running? I have seen answers on updating pom files, core-site.xml file etc. but looks like simple pyspark installation does not come with those files

  2. how can I make jar installs/setup be a default spark setting in pyspark only installation? I hope to simply run this script - python path/to/file.py without passing any arguments with spark-submit, setting it in the sparksession.config etc. I know if we have a regular spark installation, we can add the default jars to spark-defaults.conf file, but looks like plain PySpark installation does not come with those file either

Thank you in advance!

kpython
  • 363
  • 2
  • 8
  • 21

1 Answers1

1

The error message No FileSystem for scheme: gs indicates that Spark does not understand the path to your bucket (gs://) and couldn't find the GCS connector so you will have to mount the bucket first.I suggest reviewing the document to make sure that your settings were applied correctly ,Cloud Storage connector
You can also the following:

  • Authenticate your user >

from google.colab import auth auth.authenticate_user()

  • Then install gcsfuse with the following snippet>

echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list !curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - !apt -qq update !apt -qq install gcsfuse

  • Then mount the bucket as following>

mkdir mybucket !gcsfuse mybucket mybucket
You can store your data then to the following path:
df.write.csv('/content/my_bucket/df')

I would also recommend you to have a look at this thread example of a detailed workflow.
You can also try the below once:
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar

Vaidehi Jamankar
  • 1,232
  • 1
  • 2
  • 10