EntityTooLarge error when uploading a 5G file to Amazon S3

Question

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file

'/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: 
  <?xml version="1.0" encoding="UTF-8"?>
  <Error>
    <Code>EntityTooLarge</Code>
    <Message>Your proposed upload exceeds the maximum allowed size</Message>
    <ProposedSize>5374138340</ProposedSize>
    ...
    <MaxSizeAllowed>5368709120</MaxSizeAllowed>
  </Error>

This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using SchemRDD.saveAsParquetFile method. The full stack trace is

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error>
        org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source)
        org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174)
        org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
        parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
        parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
        org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?

For Python users: [Complete a multipart_upload with boto3?](https://stackoverflow.com/a/43788985/562769) — Martin Thoma, Sep 30 '19 at 08:46

score 25 · Accepted Answer · edited Jun 20 '20 at 09:12

The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:

Depending on the size of the data you are uploading, Amazon S3 offers the following options:

Upload objects in a single operation—With a single PUT operation you can upload objects up to 5 GB in size.

Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.

http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.

score 11 · Answer 2 · answered Jun 11 '20 at 17:33

11

If you are using aws cli for the upload, you can use 'aws s3 cp' command so it does not require splitting and multi part upload

aws s3 cp masive-file.ova s3://<your-bucket>/<prefix>/masive-file.ova

answered Jun 11 '20 at 17:33

Tomasz Swider

2,314
18
22

Sometimes all you need is a simple command. – VIPIN KUMAR Jan 21 '21 at 19:53

score 4 · Answer 3 · answered Feb 22 '16 at 20:40

4

The trick usually seems to be figuring out how to tell S3 to do a multipart upload. For copying data from HDFS to S3, this can be done by using the s3n filesystem and specifically enabling multipart uploads with fs.s3n.multipart.uploads.enabled=true

This can be done like:

hdfs dfs -Dfs.s3n.awsAccessKeyId=ACCESS_KEY -Dfs.s3n.awsSecretAccessKey=SUPER_SECRET_KEY -Dfs.s3n.multipart.uploads.enabled=true -cp hdfs:///path/to/source/data s3n://bucket/folder/

And further configuration can be found here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

answered Feb 22 '16 at 20:40

Sean

2,315
20
25

Glad to hear it! – Sean Mar 25 '19 at 15:00
https://stackoverflow.com/questions/55427694/why-sqoop-job-is-not-creating-dynamic-sub-directory-date-wise Can anybody please help me regarding this ? – Raj Mar 30 '19 at 02:47

EntityTooLarge error when uploading a 5G file to Amazon S3

3 Answers3

Linked