4

I have pyspark code that writes to an s3 bucket like below:

df.write.mode('overwrite').parquet([S3_BUCKET_PATH])

I am testing writing to the bucket via the bucket's access point instead. The AWS documentation has an example writing to the access point using the CLI like below:

aws s3api put-object --bucket arn:aws:s3:us-west-2:123456789012:accesspoint/prod --key my-image.jpg --body my-image.jpg

I have tried doing it like this:

df.write.mode('overwrite').parquet("arn:aws:s3:us-west-2:123456789012:accesspoint/prod")

However, I get this error:

Relative path in absolute URI

Is it possible to write to an S3 access point using pyspark?

Melissa Guo
  • 948
  • 9
  • 32
  • When I try it with s3a:// I get: null uri host. This can be caused by unencoded / in the password string – Melissa Guo Jul 01 '20 at 17:28
  • When I try it with s3:// I get: AmazonS3Exception: Bad Request – Melissa Guo Jul 01 '20 at 17:28
  • Tried using 'spark.sql.warehouse.dir' configuration as suggested here: https://stackoverflow.com/questions/38669206/spark-2-0-relative-path-in-absolute-uri-spark-warehouse. I get the error Relative path in absolute URI – Melissa Guo Jul 01 '20 at 21:05
  • Where is this running? In Amazon EMR? Or somewhere else? The meaning of "s3://", "s3n://", "s3a://" may be different depending on what is the exact underlying platform you're using. – Bruno Reis Jul 01 '20 at 22:30
  • Also, here's something you can try. Check this documentation page: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html . It describes the format of the hostname that should be used in the requests to S3 when using Access Points. Maybe you could find a way to have your Spark installation use that format? – Bruno Reis Jul 01 '20 at 22:44
  • This runs on EMR – Melissa Guo Jul 01 '20 at 23:12
  • 1
    Did you find a way to use access points with pyspark? – Biplob Biswas Aug 12 '20 at 12:15
  • I need to read from access points, anyway to do that? – Rui Yang Mar 13 '21 at 08:35
  • 1
    As of 26 July 2021 this is now possible using S3 Access Point Aliases (which were meant exactly for this usecase). See more here: https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-s3-access-points-aliases-allow-application-requires-s3-bucket-name-easily-use-access-point/ – blahblah Jul 27 '21 at 14:53

0 Answers0