5

I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).

I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need to do in order to run my script correctly.

This is my code:

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

When I run this code, I get:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

I have also tried this:

# Document
documentName = "slika2.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

But I get this error:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

Im noob in this, so any help would be good. How can I read text form my image or pdf file?

I have also added this block of code, but the error is still Unable to locate credentials.

session = boto3.Session(
    aws_access_key_id='xxxxxxxxxxxx',
    aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
taga
  • 3,537
  • 13
  • 53
  • 119
  • https://stackoverflow.com/questions/33297172/boto3-error-botocore-exceptions-nocredentialserror-unable-to-locate-credential/58431571#58431571 see this can help you. As i can see you haven't set AWS profile. – Avinash Dalvi Sep 24 '20 at 11:38
  • Any help with this: https://stackoverflow.com/questions/64101224/convert-pdf-to-jpg-in-python – taga Sep 28 '20 at 16:11
  • @aviboy2006 Can you tell me what should I add to my code when I set up the AWS profile? – taga Oct 01 '20 at 11:22
  • If u set profile then check my first answer. – Avinash Dalvi Oct 01 '20 at 12:00
  • @aviboy2006 Sorry but that does not help me. Im still learning about aws and textract. I want to be able to read text from pdf or image wile. I have the code that I wrote above, so If you can, tell me exactly that I need to do, what should I add to my code, what should I remove etc. – taga Oct 01 '20 at 12:08
  • Maybe lets start from the begining. Do you have AWS account? If yes, how do you access it? Have you setup AWS CLI as shown [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)? Do you have programatic keys to access your account? – Marcin Oct 04 '20 at 02:01
  • Yes, I have installed awscli on my mac, i set my region, access key and secret access key, but when I run the program I get the error that my keys are not valid – taga Oct 06 '20 at 09:57
  • https://github.com/aviboy2006/coding-challenge/blob/master/parse_statement.py try this. – Avinash Dalvi Oct 07 '20 at 06:05

1 Answers1

6

There is problem in passing credentials to boto3. You have to pass the credentials while creating boto3 client.

import boto3

# boto3 client
client = boto3.client(
    'textract', 
    region_name='us-west-2', 
    aws_access_key_id='xxxxxxx', 
    aws_secret_access_key='xxxxxxx'
)

# Read image
with open('slika2.png', 'rb') as document:
    img = bytearray(document.read())

# Call Amazon Textract
response = client.detect_document_text(
    Document={'Bytes': img}
)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

Do note, it is not recommended to hardcode AWS Keys in code. Please refer following this document

https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html

Vipin Kumar
  • 6,441
  • 1
  • 19
  • 25
  • I've not tested for pdf, please try and let me know if there is any issue. :) – Vipin Kumar Oct 09 '20 at 16:39
  • Its giving the error, I do not know if I can do it without s3 bucket – taga Oct 09 '20 at 18:13
  • please check this question https://stackoverflow.com/questions/64261011/using-aws-textract-for-processing-pdf – taga Oct 09 '20 at 18:13
  • Yes, you are right. For PDF, you have use asynchronous method using S3. Workaround can be to convert pdf to images and then use textract. Let me know, if you need example for that. – Vipin Kumar Oct 10 '20 at 04:00