Can you upload to S3 using a stream rather than a local file?

Question

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.

Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:

import csv
import csv
import io
import boto
from boto.s3.key import Key


conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())

I received this error: BotoClientError: s3 does not support chunked transfer

UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:

conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())

for row in testDict:
    writer.writerow(row)
    k.set_contents_from_string(f.getvalue())

f.close()

Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:

f.seek(0)
f.truncate(0)

to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

Even if you could write to S3 like you want, I would not recommend it due to consistency challenges. Why do you think it would be better to not write locally? Would you want a partial S3 object if there was an exception or issue? I presume not. — cgseller, Jun 24 '15 at 21:30
I was looking to write directly to be a little more efficient. Essentially if I write the file locally, and upload it, I'm adding uploading as an additional step, and cleaning up the local file. I don't mind having an incomplete file - I could have an incomplete file if I wrote it locally too. The system will be idempotent and either delete a file in an error state, or continue it. — inquiring minds, Jun 25 '15 at 15:24

score 54 · Answer 1 · answered Jun 25 '15 at 15:30

54

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.

import smart_open
import io
import csv

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    fout.write(f.getvalue())

    for row in testDict:
        f.seek(0)
        f.truncate(0)
        writer.writerow(row)
        fout.write(f.getvalue())

f.close()

answered Jun 25 '15 at 15:30

inquiring minds

1,785
2
15
16

For Python 2, be sure to use `StringIO.StringIO()` instead of `io.StringIO()`, else you will receive an encoding error – Anconia Aug 15 '16 at 21:17
@inquiring minds, that's a good answer. My use case is almost like yours, only the difference is rather than csv, I want to generate an XML. As I like to use templating options like Mako/ genshi for xml-generation, can u suggest me a way how to deal with it? (Generating and writing simultanously, rather than local write first) – Ahsanul Haque Dec 12 '18 at 07:49
Binary mode is not needed for StringIO, changing mode from 'wb' to 'w' fixed the issue for me. – Debodirno Chandra Feb 07 '23 at 05:13

score 1 · Answer 2 · edited Nov 12 '20 at 20:28

1

According to docs it's possible

s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

so we can use StringIO in ordinary way

Update: smart_open lib from @inquiring minds answer is better solution

edited Nov 12 '20 at 20:28

dirkjot

3,467
1
23
17

answered Feb 08 '16 at 12:02

El Ruso

14,745
4
31
54

8

I don't understand how to use this. Isn't /tmp/hello.txt a local file, which is what we're trying to avoid? – EthanP Jul 28 '16 at 20:06
@EthanP [StringIO](https://docs.python.org/2/library/stringio.html) — Read and write strings as files. Use `StringIO` object instead of file – El Ruso Jul 29 '16 at 10:45
1

No, according to [this ticket](https://github.com/boto/boto3/issues/256), it is not supported. The idea of using streams with S3 is to avoid using of static files when needed to upload huge files of some gigabytes. I am trying to solve this issue as well - i need to read a large data from mongodb and put to S3, I don't want to use files. – baldr Jul 29 '16 at 16:09
@baldr Hmmm. This trick works for me in the past. By the way in the ticket mentioned in your message I see another [useful](https://github.com/boto/boto3/issues/256#issuecomment-139609091) method. Unfortunately I don't work with amazon now and can't test it – El Ruso Aug 02 '16 at 11:47
3

I tried to dig in the `boto` sources and I see it needs to calculate MD5 checksum for each file sent. This means that the stream should be 'seekable' at least. I have non-seekable stream as I read from mongodb and I cannot rewind the data flow easily. The recommended here `smart_open` allows to use streams, but it just uses an internal buffer and then use 'multipart upload' with `boto` too. Technically it is possible to use file-like streams, but be ready that it may require a lot of memory. The idea of stream - is to use low memory to upload (probably) endless data flow. – baldr Aug 02 '16 at 19:09
@baldr However, looks like S3 in common [can](http://stackoverflow.com/q/8653146/4249707) work with this kind of files. – El Ruso Aug 03 '16 at 11:12
@el-ruso, yes, this is exactly how `smart_open` works. And it seems it is the only way to upload these files. You upload large file by smaller chunks. I'd not call it 'stream upload', just 'chunk upload'. – baldr Aug 03 '16 at 18:38

score 1 · Answer 3 · answered Jul 22 '20 at 19:23

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:

@action(detail=False, methods=['post'])
def upload_document(self, request):
     document = request.data.get('image').file
     s3.upload_fileobj(document, BUCKET_NAME, 
                                 DESIRED_NAME_OF_FILE_IN_S3, 
                                 ExtraArgs={"ServerSideEncryption": "aws:kms"})

While this approach works, it does not imply streaming - as InMemoryUploadedFile keeps the whole file in RAM. In memory files are relatively small in size - and they're not generated on-the-fly. — Eugene, Jan 20 '22 at 12:34

score 1 · Answer 4 · answered Dec 24 '20 at 20:01

1

Here is a complete example using boto3

import boto3
import io

session = boto3.Session(
    aws_access_key_id="...",
    aws_secret_access_key="..."
)

s3 = session.resource("s3")

buff = io.BytesIO()

buff.write("test1\n".encode())
buff.write("test2\n".encode())

s3.Object(bucket, keypath).put(Body=buff.getvalue())

answered Dec 24 '20 at 20:01

Scott

1,648
13
21

1

I downvoted because buff.getvalue() is clearly not a stream, but a `bytes` object https://docs.python.org/3/library/io.html#io.BytesIO.getvalue – mdurant Jul 08 '23 at 16:55

score 1 · Answer 5 · answered Feb 25 '22 at 16:41

There's a well supported library for doing just this:

pip install s3fs

s3fs is really trivial to use:

import s3fs

s3fs.S3FileSystem(anon=False)

with s3.open('mybucket/new-file', 'wb') as f:
    f.write(2*2**20 * b'a')
    f.write(2*2**20 * b'a')

Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.

This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

score 0 · Answer 6 · answered Jul 20 '20 at 13:43

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:

csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)

gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
    gz.write(csv_data.getvalue())
gz_stream.seek(0)

s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)

This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

can you explain what is my_data here ?? is it a list or dict?? — User1011, Feb 08 '22 at 06:28
According to this StackOverflow answer, `writer.writerows()` takes an iterable of iterables -- list of lists, array of arrays, etc -- as input: https://stackoverflow.com/a/33092057/165494 — Mass Dot Net, Feb 13 '22 at 02:50

score -4 · Answer 7 · answered Apr 23 '20 at 23:50

-4

To write a string to an S3 object, use:

s3.Object('my_bucket', 'my_file.txt').put('Hello there')

So convert the stream to string and you're there.

answered Apr 23 '20 at 23:50

Sam

5,375
2
45
54

This only works if the size of the object is not too big for memory. – erik258 Feb 05 '21 at 23:58

Can you upload to S3 using a stream rather than a local file?

7 Answers7

Linked

Related