0

I am trying to write a spark dataframe into google cloud storage. This dataframe has got some updates so I need a partition strategy. SO I need to write it into exact file in GCS.

i have Created a spark session as follows

        .config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
        .config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
        .config("fs.gs.project.id", project_id)\
        .config("fs.gs.auth.service.account.enable", "true")\
        .config("fs.gs.auth.service.account.project.id",project_id)\
        .config("fs.gs.auth.service.account.private.key.id",private_key_id)\
        .config("fs.gs.auth.service.account.private.key",private_key)\
        .config("fs.gs.auth.service.account.client.email",client_email)\
        .config("fs.gs.auth.service.account.email",client_email)\
        .config("fs.gs.auth.service.account.client.id",client_id)\
        .config("fs.gs.auth.service.account.auth.uri",auth_uri)\
        .config("fs.gs.auth.service.account.token.uri",token_uri)\
        .config("fs.gs.auth.service.account.auth.provider.x509.cert.url",auth_provider_x509_cert_url)\
        .config("fs.gs.auth.service.account.client_x509_cert_url",client_x509_cert_url)\
        .config("spark.sql.avro.compression.codec", "deflate")\
        .config("spark.sql.avro.deflate.level", "5")\
        .getOrCreate())

and I am writing into GCS using

df.write.format(file_format).save('gs://'+bucket_name+path+'/'+table_name+'/file_name.avro')

now i see a file written in GCP is in path

gs://bucket_name/table_name/file_name.avro/--auto assigned name--.avro

what i am expecting is the file to be written like in hadoop and final result of data file to be

gs://bucket_name/table_name/file_name.avro

can any one help me achieve this?

karthik reddy
  • 13
  • 1
  • 3

1 Answers1

1

It looks like limitation of standard Spark library. Maybe this answer will help.

You can also want to check alternative way of interacting with Google Cloud Storage from Spark, using Cloud Storage Connector with Apache Spark.

Pawel Czuczwara
  • 1,442
  • 9
  • 20