The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://
4 Answers
In my case on Spark 2.4.3 I needed to do the following to enable GCS access from Spark local. I used a JSON keyfile vs. the client.id/secret
proposed above.
In
$SPARK_HOME/jars/
, use the shadedgcs-connector
jar from here: http://repo2.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-1.9.17/ or else I had various failures with transitive dependencies.(Optional) To my
build.sbt
add:"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-1.9.17" exclude("javax.jms", "jms") exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri")
In
$SPARK_HOME/conf/spark-defaults.conf
, add:spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.google.cloud.auth.service.account.json.keyfile /path/to/my/keyfile
And everything is working.

- 181
- 2
- 5
-
Thank you for your answer. is it possible to add an example of the JSON keyfile? – orestis Jun 01 '19 at 13:49
-
the key file should be just the usual service account keys, described here: https://cloud.google.com/iam/docs/creating-managing-service-account-keys – Nate Jun 02 '19 at 17:56
I am submitting here the solution I have come up with by combining different resources:
Download the google cloud storage connector : gs-connector and store it in
$SPARK/jars/
folder (Check Alternative 1 at the bottom)Download the
core-site.xml
file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).Store the
core-site.xml
file in a folder. Personally I create the$SPARK/conf/hadoop/conf/
folder and store it there.In the spark-env.sh file indicate the hadoop conf fodler by adding the following line:
export HADOOP_CONF_DIR= =</absolute/path/to/hadoop/conf/>
Create an OAUTH2 key from the respective page of Google (
Google Console-> API-Manager-> Credentials
).Copy the credentials to the
core-site.xml
file.
Alternative 1: Instead of copying the file to the $SPARK/jars
folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH
in the spark-env.sh``folder but
SPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath
<configuration>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>Register GCS Hadoop filesystem</description>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>false</value>
<description>Force OAuth2 flow</description>
</property>
<property>
<name>fs.gs.auth.client.id</name>
<value>32555940559.apps.googleusercontent.com</value>
<description>Client id of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.auth.client.secret</name>
<value>fslkfjlsdfj098ejkjhsdf</value>
<description>Client secret of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value>_THIS_VALUE_DOES_NOT_MATTER_</value>
<description>This value is required by GCS connector, but not used in the tools provided here.
The value provided is actually an invalid project id (starts with `_`).
</description>
</property>
</configuration>

- 932
- 2
- 9
- 23
-
I followed all the steps but got the following error in pyspark: `Py4JJavaError: An error occurred while calling z:org.apache.hadoop.fs.FileSystem.get. : java.io.IOException: No FileSystem for scheme: gs` I am worried that pyspark does not run spark-env.cmd (I am using Windows). It would help to check that core-site.xml gets picked up but I don't know how... – mchl_k Apr 23 '19 at 13:03
-
Considering that it has been awhile since the last answer, I though I would share my recent solution. Note, the following instruction is for Spark 2.4.4.
- Download the "gcs-connector" for the type of Spark/Hadoop you have got from here. Search for "Other Spark/Hadoop clusters" topic.
- Move the "gcs-connector" to $SPARK_HOME/jars. See more about $SPARK_HOME below.
Make sure that all the environment variables are properly set up for you Spark application to run. This is:
a. SPARK_HOME pointing to the location where you have saved Spark installations.
b. GOOGLE_APPLICATION_CREDENTIALS pointing to the location where json key is. If you have just downloaded it, it will be in your ~/Downloads
c. JAVA_HOME pointing to the location where you have your Java 8* "Home" folder.If you are on Linux/Mac OS you can use
export VAR=DIR
, where VAR is variable and DIR the location, or if you want to set them up permanently, you can add them to ~/.bash_profile or ~/.zshrc files. For Windows OS users, in cmd writeset VAR=DIR
for shell related operations, orsetx VAR DIR
to store the variables permanently.
That has worked for me and I hope it help others too.
* Spark works on Java 8, therefore some of its features might not be compatible with the latest Java Development Kit.

- 31
- 3
Try following configuration using PySpark. JARS_PATH is a string variable, containing absolute path to jar files. Do set up required environment variables.
from pyspark.sql import SparkSession
JARS_PATH = '/LOCATION-TO-JARS/gcs-connector-hadoop3-latest.jar,/LOCATION-TO-JARS/spark-bigquery-latest_2.12.jar'
spark = sparkSession.builder.appName(SPARK_APP_NAME).config('spark.jars’,JARS_PATH).getOrCreate()
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'ture')
spark._jsc.hadoopConfiguration().set('fs.gs.project.id', ‘MY-GCP-PROJECT-ID’)
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

- 706
- 1
- 10
- 25