Pyspark: write csv to Google Cloud Storage

Question

Good day,

I am new to Google Cloud Storage and recently have been assigned with a task to write data on a GCS bucket. I've done this before for S3 but not sure how to do it with GCS. I have found some sample codes here and there (like the one in this link or this one), but none of them are what I need. What has been provided to me:

bucket_name = {
google_storage_hmac_access_id = “SOMEKEY”
google_storage_hmac_secret    = “SOMEKEY”
}

The approach in first link requires a json file for credentials which is not what I have in hand. So I used the approach in second link and added to following to my code:

spark_context._jsc.hadoopConfiguration().set(
    'fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem'
)
# This is required if you are using service account and set true,
spark_context._jsc.hadoopConfiguration().set(
    'fs.gs.auth.service.account.enable', 'false'
)
# Following are required if you are using oAuth
spark_context._jsc.hadoopConfiguration().set(
    'fs.gs.auth.client.id', gcs_key
)
spark_context._jsc.hadoopConfiguration().set(
    'fs.gs.auth.client.secret', gcs_secret
)

where gcs_key and gcs_secret, are those provided to me to connect to that bucket. And this is set to be my path:

gs://bucket_name

When I try this, it ends up opening a login page for me to give access to GCS using an email address which is clearly not the case as well. I am looking for a working example on how to read/write data from a GS bucket using those credentials.

Note1: I have using the same access_id and secret to set up gsutil and everything seems to be working fine.

Note2: I have included required jar files in spark jars directory (gcs-connector-hadoop3-latest.jar).

You need a service account to work programmatically with GCS. It must have access to the bucket you needed . The easiest way to work with this is with a credentials.json file and setting GOOGLE _APPLICCATION_CREDENTIALS environment variable. — Iñigo González, Nov 24 '20 at 18:18
@Iñigo So at the end of the day, I should be asking for that json file that has credentials in it? I don't have direct access to that bucket and things have not been set up in a way so that they could provide me the credentials file. — ahajib, Nov 24 '20 at 18:27
You need *both* the credential json file for the service account *and* configure your cluster to use that credentials. See [Configure Access to GCS from Your Cluster](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/gcp-cluster-config.html) — Iñigo González, Nov 24 '20 at 21:20

score 0 · Answer 1 · answered Nov 25 '20 at 16:26

As you can see here, most of the operations you perform in Cloud Storage must be authenticated (as read or write an object). Unless your objects are public, you must use authentication before perform an operation with an object/ bucket. You can choose between gsutil authentication, API authentication, Client library authentication or user account credentials.

Pyspark: write csv to Google Cloud Storage

1 Answers1