1

I am trying to use Amazon Textract via Python (boto3) interface. While uploading file from local drive everything goes well:

import boto3
import numpy as np

def filename_to_json(self, filename):
    client = boto3.client('textract')
    if filename is not None:
        with open(filename, 'rb') as image:
            response = client.detect_document_text(Document={'Bytes': image.read()})
    return response

My question is how to modify client.detect_document_text() command to work on an image stored previously in a variable as a numpy ndarrya. From AWS Documentation I know that:

Bytes

A blob of base64-encoded document bytes. The maximum size of a document that's provided in a blob of bytes is 5 MB. The document bytes must be in PNG or JPEG format.

If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes passed using the Bytes field.

Type: Base64-encoded binary data object

but cannot figure out how to convert numpy ndarray to get a working code.

I already tried using a number of conversion method such as numpy.ndarray.tobytes(), base64.b64encode() but with no positive results.

P.S. I am new here, please be understanding.

ITdreamer
  • 11
  • 2
  • If you use PIL, you can save a numpy array as an image fairly easily: https://stackoverflow.com/questions/902761/saving-a-numpy-array-as-an-image – Nick ODell Nov 15 '22 at 16:52
  • The problem you're running into with `numpy.ndarray.tobytes()` and similar is that AWS is expecting an image, with an image header and image compression. Simplest way of solving that is to bring in some kind of image processing library - such as PIL. – Nick ODell Nov 15 '22 at 16:53
  • Obviously saving image (with PIL) to disk and reading it from a file works fine, but... the question is how to avoid it. This solution basically means that some previous function reads image file and pass it as an array to a second function, which saves it to disk and loads it again - seems wrong to me. – ITdreamer Nov 16 '22 at 09:35
  • Technically PIL can save to any file-like object, including a BytesIO object. That lets you avoid a round trip to disk. See also https://stackoverflow.com/questions/646286/how-to-write-png-image-to-string-with-the-pil – Nick ODell Nov 16 '22 at 15:58

1 Answers1

1

You could use pip install amazon-textract-textractor which is a package that offers easy-to-use methods that take care of the conversions for you.

from PIL import Image
from textractor import Textractor

extractor = Textractor(profile_name="default")
document = extractor.detect_document_text(
    file_source=Image.fromarray(your_array)
)

GitHub: https://github.com/aws-samples/amazon-textract-textractor

Official documentation: https://aws-samples.github.io/amazon-textract-textractor/

Belval
  • 1,236
  • 10
  • 17
  • Is there a simple way to do the same avoiding using textractor package? I have the feeling that all tools required are quite basic, and I only struggle with AWS client call syntax. – ITdreamer Nov 16 '22 at 09:38