1

I have a bucket on S3. I want to be able to connect to it and read the pictures/PDFs into my EC2 machine memory, perform OCR and get needed fields.

Here is what I have done so far but unfortunately it doesn't work.

import cv2
import boto3
import matplotlib
import pytesseract
from PIL import Image


boto3.setup_default_session(profile_name='default-mfasession')
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket_name = "my_bucket"
key = "my-files/._Screenshot 2020-04-20 at 14.21.20.png"

bucket = s3_resource.Bucket(bucket_name)
object = bucket.Object(key)
response = object.get()
file_stream = response['Body']
im = Image.open(file_stream)
np.array(im)

Returns me an error:

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fae33dce110>

I have tried all the answers related to this issue in SO nothing helped. Including: matplotlib: ValueError: invalid PNG header and PIL cannot identify image file for io.BytesIO object

Please advise how to solve it?

SteveS
  • 3,789
  • 5
  • 30
  • 64
  • 1
    Are you positively, absolutely, definitely sure it **is** a PNG file? I.e., you are not blindly believing the file extension or what other tools say but you opened it with a hex viewer and saw the magic byte header (and other easily recognizable parts)? – Jongware Apr 26 '20 at 09:00
  • 1
    @usr2564301 I know what I have in my bucket, but this point is in my head (I will probably get PDF, GIF, JPEG ... files with image and I need to parse them. – SteveS Apr 26 '20 at 09:11

1 Answers1

3

This is what I usually use. Maybe it will work for you as well:

def image_from_s3(bucket, key):

    bucket = s3_resource.Bucket(bucket)
    image = bucket.Object(key)
    img_data = image.get().get('Body').read()

    return Image.open(io.BytesIO(img_data))

And in your handler you execute this:

    img = image_from_s3(image_bucket, image_key)

img should be Pillow's image if it successfully executes.

Marcin
  • 215,873
  • 14
  • 235
  • 294