2

I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Below is the code I am using to write to S3:

val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)

Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:

val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)
tusher
  • 53
  • 7

1 Answers1

2

As of July 2022 this has been implemented in the hadoop source tree in HADOOP-12020 by AWS S3 engineers.

It is still stabilising and should be out in the next feature release of hadoop 3.3.x, due late 2022.

  • anyone reading this before it ships: code is there to build yourself.
  • anyone readying this in 2023+. upgrade to hadoop 3.3.5 or later
stevel
  • 12,567
  • 1
  • 39
  • 50
  • "Why not just define a lifecycle for the bucket and have things moved over every night?" - it's because you can move objects to OneZone AI only after 30 days. it makes a lot of sense to upload directly with OZ-IA – Vladimir Semashkin Apr 27 '22 at 06:39
  • aah, that's a slightly different use case than glacier. if there's a way to mark files in that category during upload, it'd be viable. as usual: contributor to the oss codebase is expected to add new tests and declare which endpoint they ran the current tests against.... – stevel Apr 27 '22 at 18:33
  • 1
    The difference between using a lifecycle policy and specifying storage class in the request is cost. former is priced at $0.01 per 1000 requests (objects) where the latter is free – xdu Jul 28 '22 at 02:29
  • updated the answer to 2022. you are free to check out hadoop branch-3.3 and do your own release through the same doc I am using today to create 3.3.4 RC1 (which doesn't support this) https://cwiki.apache.org/confluence/display/HADOOP2/HowToRelease – stevel Jul 29 '22 at 14:50