Spark Write to S3 Storage Option

Question

I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Below is the code I am using to write to S3:

val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)

Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:

val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)

stevel · Accepted Answer · 2022-07-29T15:56:48.247

2

As of July 2022 this has been implemented in the hadoop source tree in HADOOP-12020 by AWS S3 engineers.

It is still stabilising and should be out in the next feature release of hadoop 3.3.x, due late 2022.

edited Jul 29 '22 at 15:56

answered Feb 22 '18 at 13:27

stevel

"Why not just define a lifecycle for the bucket and have things moved over every night?" - it's because you can move objects to OneZone AI only after 30 days. it makes a lot of sense to upload directly with OZ-IA – Vladimir Semashkin Apr 27 '22 at 06:39
aah, that's a slightly different use case than glacier. if there's a way to mark files in that category during upload, it'd be viable. as usual: contributor to the oss codebase is expected to add new tests and declare which endpoint they ran the current tests against.... – stevel Apr 27 '22 at 18:33
1

The difference between using a lifecycle policy and specifying storage class in the request is cost. former is priced at $0.01 per 1000 requests (objects) where the latter is free – xdu Jul 28 '22 at 02:29
updated the answer to 2022. you are free to check out hadoop branch-3.3 and do your own release through the same doc I am using today to create 3.3.4 RC1 (which doesn't support this) https://cwiki.apache.org/confluence/display/HADOOP2/HowToRelease – stevel Jul 29 '22 at 14:50

1 Answers1