6

So I am trying to use Amazon Textract to read in multiple pdf files, with multiple pages using the StartDocumentTextDetection method as follows:

client = boto3.client('textract')
textract_bucket = s3.Bucket('my_textract_console-us-east-2')

for s3_file in textract_bucket.objects.all():
    print(s3_file)

    response = client.start_document_text_detection(
        DocumentLocation = {
                "S3Object": {
                    "Bucket": "my_textract_console_us-east-2",
                    "Name": s3_file.key,
                    
                } 
        },
        ClientRequestToken=str(random.randint(1,1e10)))
    print(response)
    break
     

When just trying to retrieve the response object from s3, I'm able to see it printed out as:

s3.ObjectSummary(bucket_name='my_textract_console-us-east-2', key='C:\\Users\\My_User\\Documents\\Folder\\Sub_Folder\\Sub_sub_folder\\filename.PDF')

Correspondingly, I'm using that s3_file.key to access the object later. But I'm getting the following error that I can't figure out:

InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the StartDocumentTextDetection operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

So far I have:

  1. Checked the region from boto3 session, both the bucket and aws configurations settings are set to us-east-2.
  2. Key cannot be wrong, I'm passing it directly from the object response
  3. Permissions wise, I checked the IAM console, and have it set to AmazonS3FullAccess and AmazonTextractFullAccess.

What could be going wrong here?

[EDIT] I did rename the files so that they didn't have \\, but seems like it's still not working, that's odd..

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
ocean800
  • 3,489
  • 13
  • 41
  • 73
  • That file key looks like a local file, not an S3 key. – stdunbar Aug 31 '20 at 16:11
  • @stdunbar oh well in this case I used `response = s3_client.upload_file(file_name, bucket, object_name=file_name)` to upload the objects into `s3`, so it shouldn't be an issue. I purposely named the `object_name` to be a filepath – ocean800 Aug 31 '20 at 16:28
  • 1
    From the [S3 Object key and metada](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html) docs, the backslash character is in the "Characters to avoid" section. For testing, does it work without that character? – stdunbar Aug 31 '20 at 16:38
  • @stdunbar yes unfortunately I just tried that and it's still not working – ocean800 Aug 31 '20 at 19:56

1 Answers1

5

I ran into the same issue and solved it by specifying a region in extract client. In my case I used us-east2

client = boto3.client('textract', region_name='us-east-2')

The clue to do so came from this issue: https://github.com/aws/aws-sdk-js/issues/2714

Luis Da Silva
  • 313
  • 2
  • 8