Amazon Textract enables document text detection and analysis in applications. The Amazon Textract Text Detection API can detect text in a variety of documents including financial reports, medical records, and tax forms. For documents with structured data, you can use the Amazon Textract Document Analysis API to detect linked text, tables, option buttons (radio buttons), and check boxes.
Questions tagged [amazon-textract]
226 questions
20
votes
2 answers
Unsupported Document format while using Amazon Textract,
When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported document format.
i am using Amazon textract with boto3. When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has…

Jung Thapa
- 201
- 1
- 2
- 3
12
votes
2 answers
Amazon Textract vs Amazon Rekognition DetectText
How do I decide when to use Amazon Textract vs Amazon Rekognition's TextDetect method?
My usecase is click picture from mobile and convert image data into text and store into AWS…

vaquar khan
- 10,864
- 5
- 72
- 96
10
votes
5 answers
AWS Textract StartDocumentAnalysis function not publishing a message to the SNS Topic
I am working with AWS Textract and I want to analyze a multipage document, therefore I have to use the async options, so I first used startDocumentAnalysisfunction and I got a JobId as the return, But it needs to trigger a function that I have set…

gokublack
- 1,260
- 2
- 15
- 36
7
votes
6 answers
How to use the Amazon Textract with PDF files
I already can use the textract but with JPEG files. I would like to use it with PDF files.
I have the code bellow:
import boto3
# Document
documentName = "Path to document in JPEG"
# Read document content
with open(documentName, 'rb') as…

ArthurS
- 137
- 1
- 2
- 5
6
votes
1 answer
InvalidS3ObjectException: Unable to get object metadata from S3?
So I am trying to use Amazon Textract to read in multiple pdf files, with multiple pages using the StartDocumentTextDetection method as follows:
client = boto3.client('textract')
textract_bucket = s3.Bucket('my_textract_console-us-east-2')
for…

ocean800
- 3,489
- 13
- 41
- 73
6
votes
2 answers
AWS Textract - UnsupportedDocumentException - PDF
I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object =…

gmwill934
- 609
- 1
- 10
- 27
6
votes
5 answers
Amazon textextract I can't find trp module
I want to use this amazon table textract script
The problem I encounter is that I don't have any clue what is trp module and how I can install it.
I tried
pip install trp
But when I try to run then I get this…

Iakovos Belonias
- 1,217
- 9
- 25
5
votes
2 answers
Using Textract, how do you extract tables from a pdf file and output it into a csv file via .py script?
I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from the file.
Any suggestions for writing the .py script…

Chris You
- 75
- 4
5
votes
1 answer
Using Textract for OCR locally
I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).
I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need…

taga
- 3,537
- 13
- 53
- 119
5
votes
2 answers
How to retrieve tables which exists in a pdf using AWS Textract in java
I found article below to do in python.
https://docs.aws.amazon.com/textract/latest/dg/examples-export-table-csv.html
also I used article below to extract text.
https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html
but above…

Farhan
- 105
- 1
- 10
5
votes
4 answers
AWS Textract InvalidParameterException
I have a .Net core client application using amazon Textract with S3,SNS and SQS as per the AWS Document , Detecting and Analyzing Text in Multipage Documents(https://docs.aws.amazon.com/textract/latest/dg/async.html)
Created an AWS Role with…

Nabeel
- 323
- 3
- 9
4
votes
0 answers
URL.hostname is not implemented
I'm looking for some help on my textract client project. I am trying to follow the AWS Textract documentation, but I am stuck at the textractClient.send(). I am getting the error URL.hostname is not implemented
I have followed the steps on AWS to…

Jackc01999
- 41
- 3
4
votes
1 answer
How to get the font style from an image with text?
I am using the Amazon Textract API, through AWS' Python API, to extract text from a document (pdf or jpg). I do get the text and coordinates of its bounding box, but I would also love to have the font type (only the major ones needed: Arial,…

tyrex
- 8,208
- 12
- 43
- 50
4
votes
0 answers
AccessDeniedException when calling AnalyzeDocument
When calling AnalyzeDocument I receive an Amazon.Textract.Model.AccessDeniedException:
Additional information: User: arn:aws:iam::[number]:user/service is not
authorized to perform: textract:AnalyzeDocument
The user is in a group with the…

Wolfgang Radl
- 2,319
- 2
- 17
- 22
3
votes
3 answers
I am using aws textract StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. I want only lines to be returned, not the single words
I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines…

Faris Ashhab
- 31
- 2