JPG files uploaded with lambda to S3 are corrupt

Question

I have this simple python lambda that downloads a JPG image and uploads it to a S3 bucket.

url = 'https://somesite.com/11/frame.jpg?abs_begin=2019-08-29T05:18:26Z'

s3 = boto3.client('s3')

with contextlib.closing(requests.get(url, stream=True, verify=False)) as response:

    fp = BytesIO(response.content)

    s3.upload_fileobj(fp, bucket_name, 'my-dir/' + 'test_img.jpg')

However, when looking in my bucket it says file size is 162 bytes. When dowloading it from the browser GUI to my local disk macOS prompts: The file "test_img.jpg" could not be opened. and It may be damaged or use a file format that Preview doesn’t recognise.

Any idea what causes this?

Inspect the content of the object you uploaded. e.g. `cat test_img.jpg`. Perhaps you have downloaded and uploaded an error message to s3. — cementblocks, Aug 30 '19 at 16:21

score 1 · Accepted Answer · answered Aug 30 '19 at 16:50

are you sure that site is giving you a JPEG file? I'd suggest checking response.status_code somehow, I normally just put a raise_for_status() in there and let the calling code handle the exception

also, you only need to pass stream=True if you're actually streaming the content, you're just reading everything in one go there and requesting a stream is a waste. streaming is recommended otherwise you need to hold the whole file in memory which can be a waste

if you want to check that you're actually getting an image, you could use Pillow to open the image before uploading to S3, something like:

import tempfile

import requests
from PIL import Image  # pip install -U Pillow

# dummy image
url = 'https://picsum.photos/id/1053/1500/1000'

# get a temp file in case we get a large image
with tempfile.TemporaryFile() as fd:
    with requests.get(url, stream=True) as response:
        # make sure HTTP request succeeded
        response.raise_for_status()

        for data in response.iter_content(8192):
            fd.write(data)

    # seek back to beginning of file and load to make sure it's OK
    fd.seek(0)
    with Image.open(fd) as img:
        # will raise an exception on failure
        img.verify()
        print(f'got a {img.format} image of size {img.size}' )

    # let S3 do its thing
    s3.upload_fileobj(fd, bucket_name, 'my-dir/test_img.jpg')

Thank you detailed answer! Your script almost worked, I had to add a `fd.seek(0)` right before `s3.upload_fileobj( ... )` to get it to work properly. The frames are probably less than 20KB, do you mean this script would be overkill? My end goal is to read **lots** of frames to the s3 bucket. — NorwegianClassic, Sep 02 '19 at 07:42
could you define "lots"? maybe you could group them up and put them into some sort of archive, or maybe put them into an [mpeg file](https://stackoverflow.com/a/44948030/1358308) first? — Sam Mason, Sep 02 '19 at 08:36
What, isn't "lots" well defined? ;) Somewhere between a few hundred to 10s of thousand. Maybe, but they have to be stored as jpgs in the s3. This is going a bit of topic I'm sorry about that. — NorwegianClassic, Sep 02 '19 at 08:43
order of magnitude was fine, algorithm choices only really change every two or three orders. given that you need them to be JPEG objects in S3 I think just doing the above in a loop is about all you can do. depending on where they are coming from, you might think about doing it in parallel, the standard `multiprocessing` module might help (aim to pass a list of URLs to `Pool.map`) or you could use `joblib` — Sam Mason, Sep 02 '19 at 08:50

JPG files uploaded with lambda to S3 are corrupt

1 Answers1