6

I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Am I missing something?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
gmwill934
  • 609
  • 1
  • 10
  • 27

2 Answers2

8

AnalyzeDocument is a synchronous API that only supports PNG or JPG images.

Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API e.g StartDocumentAnalysis, StartDocumentTextDetection

aksyuma
  • 2,957
  • 1
  • 15
  • 29
3

As the docs say

StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.

Boto3 Example

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

Additionally, AWS docs provides a class TextractWrapper with methods start_analysis_job and get_analysis_job to do the same as the previous example.

Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49