How can I make Apache Spark use multipart uploads when saving data to Amazon S3. Spark writes data using RDD.saveAs...File
methods. when the destination is start with s3n://
Spark automatically uses JetS3Tt to do the upload, but this fails for files larger than 5G. Large files need to be uploaded to S3 using multipart upload, which is supposed to be beneficial for smaller files as well. Multipart uploads are supported in JetS3Tt with MultipartUtils
, but Spark does not use this in the default configuration. Is there a way to make it use this functionality.
Asked
Active
Viewed 3,118 times
5

Daniel Mahler
- 7,653
- 5
- 51
- 90
2 Answers
0
s3n seems to be on deprecation path.
From their documentation
Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability

rishi
- 155
- 1
- 6