Spark s3 write (s3 vs s3a connectors)

Question

I am working on a job that runs on EMR and it saves thousands of partitions on s3. Partitions are year/month/day.

I have data from the last 50 years. Now when spark writes 10000 partitions, it takes around 1-hour using the s3a connection. It is extremely slow.

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3a://mybucket/data")

Then I tried with only s3 prefix and it took only a few minutes to save all the partitions on S3.

df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3://mybucket/data")

When I overwritten 1000 partitions, s3 was very fast in compare to s3a

 df
 .repartition($"year", $"month", $"day")
 .write
 .option("partitionOverwriteMode", "dynamic")
 .mode("overwrite").partitionBy("year", "month", "day")
 .parquet("s3://mybucket/data")

As per my understanding, s3a is more mature and currently in use. s3/s3n are old connectors and they are deprecated. So I am wondering what to use? Should I use 's3`? What is the best s3 connect or s3 URI to use with EMR jobs that save data into s3?

you use EMR, you go with their own s3:// connector. it's what they support/maintain, and has nothing to do with the (now completely deleted) s3/s3n connectors in the Apache Hadoop codebase — stevel, Nov 18 '21 at 14:12

score 5 · Answer 1 · answered Mar 22 '22 at 11:53

As Stevel pointed out, the s3:// connector used in Amazon EMR is built by amazon for EMR to interact with S3, and is the recommended way to do so according to Amazon EMR Work with storage and file systems:

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.

Some more interesting stuff: The Apache Hadoop community also developed its own S3 connector and S3a:// is the actively maintained one. The Hadoop community had also used a connector that was named S3:// that probably added to confusion. From hadoop docs:

There are other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.

Apache’s Hadoop’s original s3:// client. This is no longer included in Hadoop.

Amazon EMR’s s3:// client. This is from the Amazon EMR team, who actively maintain it.

Apache’s Hadoop’s s3n: filesystem client. This connector is no longer available: users must migrate to the newer s3a: client.

score 0 · Answer 2 · answered Mar 02 '23 at 19:24

In case someone is on the same boat...when s3a:// is used, EMR Spark writes to any SSE-KMS enabled S3 bucket with a default AWS-managed KMS key regardless of:

The bucket default KMS key settings
The KMS key specified in the EMR configuration (eg. here)

Spark s3 write (s3 vs s3a connectors)

2 Answers2

Linked