95

I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.

S3 can support streaming uploads. For example, see here:

http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/

My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?

Tyler
  • 28,498
  • 11
  • 90
  • 106
  • The [smart_open](https://github.com/piskvorky/smart_open) Python library does that for you (streamed read and write). – Radim Jan 26 '15 at 08:43
  • 2
    10 years later & the AWS S3 SDKs *still* don't have a managed way to do this - as someone who is hugely invested in the AWS ecosystem, it's very disappointing to see this in comparison to object management SDKs provided by other cloud providers. This is a core feature missing. – Ermiya Eskandary Mar 14 '22 at 14:42
  • @ErmiyaEskandary actually Go SDK has it, but both v1 and v2 have memory leak issues for the multipart upload method (uploader.Upload ) – Nikolay Dimitrov Aug 31 '23 at 01:21

6 Answers6

92

You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.

  • Initiate S3 Multipart Upload.
  • Gather data into a buffer until that buffer reaches S3's lower chunk-size limit (5MiB). Generate MD5 checksum while building up the buffer.
  • Upload that buffer as a Part, store the ETag (read the docs on that one).
  • Once you reach EOF of your data, upload the last chunk (which can be smaller than 5MiB).
  • Finalize the Multipart Upload.

S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.

However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.

First 25 parts:   5MiB (total:  125MiB)
Next 25 parts:   10MiB (total:  375MiB)
Next 25 parts:   25MiB (total:    1GiB)
Next 25 parts:   50MiB (total: 2.25GiB)
After that:     100MiB

This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.


A note on your link to Sean O'Donnells blog:

His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:

data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()

His solution does it by getting the Content-Length via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:

upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()

input = File.open(file_name, File::READONLY_FLAG)

while (data = input.read())
  input.write(data)
end

upload.flush()
upload.close()
Marcel Jackwerth
  • 53,948
  • 9
  • 74
  • 88
  • 1
    A java implementation of this in the form of an OutputStream exists in s3distcp https://github.com/libin/s3distcp/blob/master/src/main/java/com/amazon/external/elasticmapreduce/s3distcp/MultipartUploadOutputStream.java – sigget Dec 02 '14 at 23:11
  • 2
    I've created an open source library dedicated to this at https://github.com/alexmojaki/s3-stream-upload – Alex Hall Oct 22 '15 at 14:13
  • 1
    Where did you find the 5MiB limit? – Landon Kuhn Jan 18 '17 at 20:36
  • 1
    Looks like you can also use the cli now with pipe - https://github.com/aws/aws-cli/pull/903 – chrismarx Oct 30 '18 at 15:32
  • @AlexHall any python implementation? – Tushar Kolhe May 09 '20 at 10:03
  • @TusharKolhe googling "python stream multipart upload s3" I found https://stackoverflow.com/questions/31031463/can-you-upload-to-s3-using-a-stream-rather-than-a-local-file and https://stackoverflow.com/questions/52825430/stream-large-string-to-s3-using-boto3 and it looks like there were more results – Alex Hall May 09 '20 at 10:08
  • @AlexHall thanx i figured out the way, this is the actual problem that i m trying to solve https://stackoverflow.com/questions/61696155/python-boto3-multipart-upload-video-to-aws-s3. In case of a file already on disk i m able to do this.. but i want to upload streaming frames – Tushar Kolhe May 09 '20 at 12:24
9

Putting this answer here for others in case it helps:

If you don't know the length of the data you are streaming up to S3, you can use S3FileInfo and its OpenWrite() method to write arbitrary data into S3.

var fileInfo = new S3FileInfo(amazonS3Client, "MyBucket", "streamed-file.txt");

using (var outputStream = fileInfo.OpenWrite())
{
    using (var streamWriter = new StreamWriter(outputStream))
    {
        streamWriter.WriteLine("Hello world");
        // You can do as many writes as you want here
    }
}
mwrichardson
  • 1,148
  • 1
  • 8
  • 12
7

You can use the gof3r command-line tool to just stream linux pipes:

$ tar -czf - <my_dir/> | gof3r put --bucket <s3_bucket> --key <s3_object>
webwurst
  • 4,830
  • 3
  • 23
  • 32
  • is there a way to just do `tar -czf - | aws s3 --something-or-other` ? –  Aug 01 '19 at 23:07
2

If you are using Node.js you can use a plugin like s3-streaming-upload to accomplish this quite easily.

nathanpeck
  • 4,608
  • 1
  • 20
  • 18
1

reference to :https://github.com/aws/aws-cli/pull/903

Here is a synopsis: For uploading a stream from stdin to s3, use: aws s3 cp - s3://my-bucket/stream

For downloading an s3 object as a stdout stream, use: aws s3 cp s3://my-bucket/stream -

So for example, if I had the object s3://my-bucket/stream, I could run this command: aws s3 cp s3://my-bucket/stream - | aws s3 cp - s3://my-bucket/new-stream

my cmd:

echo "ccc" | aws --endpoint-url=http://172.22.222.245:80 --no-verify-ssl s3 cp - s3://test-bucket/ccc

Drawn Yang
  • 11
  • 3
1

Refer more on HTTP multi-part enitity requests. You can send a file as chunks of data to the target.

Kris
  • 8,680
  • 4
  • 39
  • 67