1

I am generating a large file in Python from an asynchronous queue that transforms many units of data and appends them (unordered) into a large file.

The final destination of this file is S3. To save I/O and dead time (wait for file to be complete before uploading), I would like to avoid writing the file to local disk first, and just stream the data to S3 as they are generated.

The units are all of different size but I can specify a reasonable max chunk size that is larger than any unit.

Most of the examples I see on the Web (e.g. https://medium.com/analytics-vidhya/aws-s3-multipart-upload-download-using-boto3-python-sdk-2dedb0945f11) describe how to do a multi-part upload with boto3 from a file, not from data generated at runtime.

Is this possible, and a recommended approach?

EDIT: I removed the "multi-part" term from the title because I realized it could be misleading. What I really need is serial streaming of data chunks.

Thanks.

user3758232
  • 758
  • 5
  • 19
  • It's awkward that I tagged the post "s3" and it became "amazon-s3" complete with branding. Actually at the moment I am testing on a local S3 server that has nothing to do with AWS... – user3758232 Sep 29 '21 at 17:12

2 Answers2

1

The upload() method of the MultipartUploadPart object accepts a parameter Body that can either be a file-like object or a bytes object, which is what you want.

Take a look at the documentation.

SebDieBln
  • 3,303
  • 1
  • 7
  • 21
  • Thanks for the suggestion, unfortunately that won't work for me because it requires to number the parts (1 to 10000 max) and each part must be at least 5Mb. I have millions of units, 2-4Kb each; assembling them to fit the API requirements would be complicated. – user3758232 Sep 30 '21 at 15:26
  • @user3758232 OK, it seems to be a limitation of the multipart uploading process in general, not specific to uploading a buffer instead of uploading a file. – SebDieBln Oct 01 '21 at 16:10
0

(Answering the updated question of streaming vs uploading a file)

I don't think it is possible to start an upload for a completely unknown amount of data (streaming).

All the upload functions either take bytes or a seekable file-like-object (see the documentation for S3.object.put(). This is most likely because they need to know the size of the data prior to actually transmitting it.

You could however consider saving each result as a single object in S3 and assembling it into one large file only when downloading it. But that would require a special programm to download the data and it might also increase the costs due to a higher number of requests and objects.

SebDieBln
  • 3,303
  • 1
  • 7
  • 21
  • That's not doable in my case, where the effort to assemble a single file from millions of units would be repeated for every download. – user3758232 Oct 02 '21 at 01:54
  • @user3758232 Another solution would be to upload a single objects and let some AWS lambda assemble the large file. – SebDieBln Oct 03 '21 at 22:50
  • @user3758232 Or you could take a look at Amazon-EFS, where you have actual files that you can append on. – SebDieBln Oct 03 '21 at 22:53