4

I'm trying to read pyspark DataFrame from Google Cloud Storage, but I keep getting an error that the service account has no storage.objects.create permissions. The account does not have WRITER permissions, but it's just reading parquet files:

spark_session.read.parquet(input_path)

18/12/25 13:12:00 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Repairing batch of 1 missing directories.
18/12/25 13:12:01 ERROR com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Failed to repair some missing directories.
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "***.gserviceaccount.com does not have storage.objects.create access to ***.",
    "reason" : "forbidden"
  } ],
  "message" : "***.gserviceaccount.com does not have storage.objects.create access to ***."
}
mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
Yoav
  • 999
  • 3
  • 11
  • 30
  • Where are you running this code from? Also, does the error go away when adding 'storage.objects.create' permission to the Service Account? – Maxim Dec 26 '18 at 11:08
  • @Maxim , I'm running this code on a pyspark app running on a dataproc cluster and scheduled with Airflow. I cannot test that because this account cannot get this access level – Yoav Dec 27 '18 at 13:53

2 Answers2

6

We found the issue. It's due to the implicit auto repair feature in the GCS connector. We disabled this behavior by setting fs.gs.implicit.dir.repair.enable to false.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Yoav
  • 999
  • 3
  • 11
  • 30
  • 1
    for those who may not know, the above happens in hadoop configuration within spark context, if using pyspark: sc._jsc.hadoopConfiguration().set("fs.gs.implicit.dir.repair.enable","false") – Moein Sep 01 '20 at 05:08
2

Please see this question: Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

Spark will create temporary files when doing some actions. I have run into this when extracting data from GCS files and converting to a user-defined object. It can also do this when loading into BQ because writing to Storage and then executing a single load from GCS is more efficient. You can see the change which did that here.

Sadly, there is no concrete link I can give you because the problem is not documented, as far as I know. I will try to find one for you and will update my response if I succeed.

Bryan Davis
  • 134
  • 6
  • Thanks, Bryan. So, are you suggesting that DataFrame.read() might be one of these actions? Are you familiar with any parameter that I can set to avoid that? – Yoav Dec 30 '18 at 09:01