74

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.

Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:

import csv
import csv
import io
import boto
from boto.s3.key import Key


conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())

I received this error: BotoClientError: s3 does not support chunked transfer

UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:

conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())

for row in testDict:
    writer.writerow(row)
    k.set_contents_from_string(f.getvalue())

f.close()

Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:

f.seek(0)
f.truncate(0)

to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

inquiring minds
  • 1,785
  • 2
  • 15
  • 16
  • Even if you could write to S3 like you want, I would not recommend it due to consistency challenges. Why do you think it would be better to not write locally? Would you want a partial S3 object if there was an exception or issue? I presume not. – cgseller Jun 24 '15 at 21:30
  • 3
    I was looking to write directly to be a little more efficient. Essentially if I write the file locally, and upload it, I'm adding uploading as an additional step, and cleaning up the local file. I don't mind having an incomplete file - I could have an incomplete file if I wrote it locally too. The system will be idempotent and either delete a file in an error state, or continue it. – inquiring minds Jun 25 '15 at 15:24

7 Answers7

54

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.

import smart_open
import io
import csv

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    fout.write(f.getvalue())

    for row in testDict:
        f.seek(0)
        f.truncate(0)
        writer.writerow(row)
        fout.write(f.getvalue())

f.close()
inquiring minds
  • 1,785
  • 2
  • 15
  • 16
  • For Python 2, be sure to use `StringIO.StringIO()` instead of `io.StringIO()`, else you will receive an encoding error – Anconia Aug 15 '16 at 21:17
  • @inquiring minds, that's a good answer. My use case is almost like yours, only the difference is rather than csv, I want to generate an XML. As I like to use templating options like Mako/ genshi for xml-generation, can u suggest me a way how to deal with it? (Generating and writing simultanously, rather than local write first) – Ahsanul Haque Dec 12 '18 at 07:49
  • Binary mode is not needed for StringIO, changing mode from 'wb' to 'w' fixed the issue for me. – Debodirno Chandra Feb 07 '23 at 05:13
1

According to docs it's possible

s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

so we can use StringIO in ordinary way

Update: smart_open lib from @inquiring minds answer is better solution

dirkjot
  • 3,467
  • 1
  • 23
  • 17
El Ruso
  • 14,745
  • 4
  • 31
  • 54
  • 8
    I don't understand how to use this. Isn't /tmp/hello.txt a local file, which is what we're trying to avoid? – EthanP Jul 28 '16 at 20:06
  • @EthanP [StringIO](https://docs.python.org/2/library/stringio.html) — Read and write strings as files. Use `StringIO` object instead of file – El Ruso Jul 29 '16 at 10:45
  • 1
    No, according to [this ticket](https://github.com/boto/boto3/issues/256), it is not supported. The idea of using streams with S3 is to avoid using of static files when needed to upload huge files of some gigabytes. I am trying to solve this issue as well - i need to read a large data from mongodb and put to S3, I don't want to use files. – baldr Jul 29 '16 at 16:09
  • @baldr Hmmm. This trick works for me in the past. By the way in the ticket mentioned in your message I see another [useful](https://github.com/boto/boto3/issues/256#issuecomment-139609091) method. Unfortunately I don't work with amazon now and can't test it – El Ruso Aug 02 '16 at 11:47
  • 3
    I tried to dig in the `boto` sources and I see it needs to calculate MD5 checksum for each file sent. This means that the stream should be 'seekable' at least. I have non-seekable stream as I read from mongodb and I cannot rewind the data flow easily. The recommended here `smart_open` allows to use streams, but it just uses an internal buffer and then use 'multipart upload' with `boto` too. Technically it is possible to use file-like streams, but be ready that it may require a lot of memory. The idea of stream - is to use low memory to upload (probably) endless data flow. – baldr Aug 02 '16 at 19:09
  • @baldr However, looks like S3 in common [can](http://stackoverflow.com/q/8653146/4249707) work with this kind of files. – El Ruso Aug 03 '16 at 11:12
  • @el-ruso, yes, this is exactly how `smart_open` works. And it seems it is the only way to upload these files. You upload large file by smaller chunks. I'd not call it 'stream upload', just 'chunk upload'. – baldr Aug 03 '16 at 18:38
1

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:

@action(detail=False, methods=['post'])
def upload_document(self, request):
     document = request.data.get('image').file
     s3.upload_fileobj(document, BUCKET_NAME, 
                                 DESIRED_NAME_OF_FILE_IN_S3, 
                                 ExtraArgs={"ServerSideEncryption": "aws:kms"})
  • 1
    While this approach works, it does not imply streaming - as InMemoryUploadedFile keeps the whole file in RAM. In memory files are relatively small in size - and they're not generated on-the-fly. – Eugene Jan 20 '22 at 12:34
1

Here is a complete example using boto3

import boto3
import io

session = boto3.Session(
    aws_access_key_id="...",
    aws_secret_access_key="..."
)

s3 = session.resource("s3")

buff = io.BytesIO()

buff.write("test1\n".encode())
buff.write("test2\n".encode())

s3.Object(bucket, keypath).put(Body=buff.getvalue())
Scott
  • 1,648
  • 13
  • 21
  • 1
    I downvoted because buff.getvalue() is clearly not a stream, but a `bytes` object https://docs.python.org/3/library/io.html#io.BytesIO.getvalue – mdurant Jul 08 '23 at 16:55
1

There's a well supported library for doing just this:

pip install s3fs

s3fs is really trivial to use:

import s3fs

s3fs.S3FileSystem(anon=False)

with s3.open('mybucket/new-file', 'wb') as f:
    f.write(2*2**20 * b'a')
    f.write(2*2**20 * b'a')

Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.

This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

Philip Couling
  • 13,581
  • 5
  • 53
  • 85
0

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:

csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)

gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
    gz.write(csv_data.getvalue())
gz_stream.seek(0)

s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)

This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

Mass Dot Net
  • 2,150
  • 9
  • 38
  • 50
  • can you explain what is my_data here ?? is it a list or dict?? – User1011 Feb 08 '22 at 06:28
  • According to this StackOverflow answer, `writer.writerows()` takes an iterable of iterables -- list of lists, array of arrays, etc -- as input: https://stackoverflow.com/a/33092057/165494 – Mass Dot Net Feb 13 '22 at 02:50
-4

To write a string to an S3 object, use:

s3.Object('my_bucket', 'my_file.txt').put('Hello there')

So convert the stream to string and you're there.

Sam
  • 5,375
  • 2
  • 45
  • 54