6

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.

This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:

    s3 = boto3.resource('s3')
    data = '{"some":"json","test":"data"}'
    key = "path/to/my/file/[timestamp].json"      
    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:

    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

fowtom
  • 83
  • 1
  • 5
  • How much time on average is the `put_object` call taking? You said the profiling was performed on the 4 lines of code you included in the question. If that's the case, then I would think that 96% of the entire program time makes sense as the S3 API call is the only external service call. If you get 96% and you also have the code for pulling ES data during the profile run, then more detail might be needed. – dmulter Jun 20 '18 at 22:24
  • @dmulter In the context of my program, the s3 instance is initialized before entering a loop where the unique key and data are created, and then the `put_object()` is called. I tested it out by scrolling three 100-length pages, so for 300 `put_object()` calls it took about 60 seconds (entire program execution was ~63 seconds). The whole program consists of importing libraries and one function, and I profiled that function. – fowtom Jun 20 '18 at 22:38

1 Answers1

4

Since you are potentially uploading many small files, you should consider a few items:

  • Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
  • Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
  • S3 performance implications as described in Request Rate and Performance Considerations
dmulter
  • 2,608
  • 3
  • 15
  • 24