15

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file

'/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: 
  <?xml version="1.0" encoding="UTF-8"?>
  <Error>
    <Code>EntityTooLarge</Code>
    <Message>Your proposed upload exceeds the maximum allowed size</Message>
    <ProposedSize>5374138340</ProposedSize>
    ...
    <MaxSizeAllowed>5368709120</MaxSizeAllowed>
  </Error>

This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using SchemRDD.saveAsParquetFile method. The full stack trace is

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error>
        org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source)
        org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174)
        org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
        parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
        parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
        org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

3 Answers3

25

The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:

Depending on the size of the data you are uploading, Amazon S3 offers the following options:

  • Upload objects in a single operation—With a single PUT operation you can upload objects up to 5 GB in size.

  • Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.

http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.

Community
  • 1
  • 1
Michael - sqlbot
  • 169,571
  • 25
  • 353
  • 427
11

If you are using aws cli for the upload, you can use 'aws s3 cp' command so it does not require splitting and multi part upload

aws s3 cp masive-file.ova s3://<your-bucket>/<prefix>/masive-file.ova
Tomasz Swider
  • 2,314
  • 18
  • 22
4

The trick usually seems to be figuring out how to tell S3 to do a multipart upload. For copying data from HDFS to S3, this can be done by using the s3n filesystem and specifically enabling multipart uploads with fs.s3n.multipart.uploads.enabled=true

This can be done like:

hdfs dfs -Dfs.s3n.awsAccessKeyId=ACCESS_KEY -Dfs.s3n.awsSecretAccessKey=SUPER_SECRET_KEY -Dfs.s3n.multipart.uploads.enabled=true -cp hdfs:///path/to/source/data s3n://bucket/folder/

And further configuration can be found here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Sean
  • 2,315
  • 20
  • 25
  • Glad to hear it! – Sean Mar 25 '19 at 15:00
  • https://stackoverflow.com/questions/55427694/why-sqoop-job-is-not-creating-dynamic-sub-directory-date-wise Can anybody please help me regarding this ? – Raj Mar 30 '19 at 02:47