I am working on a job that runs on EMR and it saves thousands of partitions on s3. Partitions are year/month/day.
I have data from the last 50 years. Now when spark writes 10000 partitions, it takes around 1-hour using the s3a
connection. It is extremely slow.
df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3a://mybucket/data")
Then I tried with only s3 prefix and it took only a few minutes to save all the partitions on S3.
df.repartition($"year", $"month", $"day").write.mode("append").partitionBy("year", "month", "day").parquet("s3://mybucket/data")
When I overwritten 1000 partitions, s3 was very fast in compare to s3a
df
.repartition($"year", $"month", $"day")
.write
.option("partitionOverwriteMode", "dynamic")
.mode("overwrite").partitionBy("year", "month", "day")
.parquet("s3://mybucket/data")
As per my understanding, s3a is more mature and currently in use. s3/s3n are old connectors and they are deprecated. So I am wondering what to use? Should I use 's3`? What is the best s3 connect or s3 URI to use with EMR jobs that save data into s3?