How to analyse PDF documents with Amazon Textract in a Synchronous way?

Question

I want to extract tables from a bunch of PDFs I have. To do this I am using AWS Textract Python pipeline.

Please advise how can I do this without SNS and SQS? I want it to be synchronous: provide my pipeline a PDF file, call AWS Textract and get the results.

Here is what I use meanwhile, please advise what should I change:

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('\033[94m' +  item["Text"] + '\033[0m')

score 2 · Accepted Answer · answered Jun 03 '20 at 13:43

You cannot directly process PDF documents synchronously with Textract currently. From the Textract documentation:

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

A work-around would be to convert the PDF document into images in your code and then use the synchronous API operations with these images to process the documents.

Just be careful if you follow the pdf2image route. I have had errors in the extraction process due to low dpi jpg files. — Daniel, Oct 02 '20 at 21:50

Soumya · Answer 2 · 2023-05-01T04:38:22.673

Thanks for the answers and, those answers helped me to analyse more on this. I found that detect_document_text method in Textract can be used for PDF document text extraction with a condition that the PDF document should have only one page. This is a synchronous process. We do not have to convert the pdf to image at all.

This is the link from AWS for the reference . https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/detect_document_text.html

Below is the code snippet, where i am passing the binary content from S3 object

obj = bucket.Object('Test.pdf') 
binary_file = obj.get().get('Body').read()

textract = boto3.client(service_name = "textract",region_name = "us-east-1")

def get_textract_response(file_content):
    response = None
    try:
        response = textract.detect_document_text(Document={'Bytes': file_content})
        logger.info(f"Detected {len(response['Blocks'])} blocks.")
    except ClientError:
        logger.exception("Couldn't detect text.")
        response = "Uncertain"

    except BaseException:
        logger.info("textract could not detect text")
        response = "Uncertain"
                    
    else:
        return response
    
response = get_textract_response(binary_file)

How to analyse PDF documents with Amazon Textract in a Synchronous way?

2 Answers2

Linked