I am using aws textract StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. I want only lines to be returned, not the single words

Question

I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines first and then starts detecting each word by word. Is there any way for me to add in a parameter or something where it will just return the lines for me and not the word by word in the pdf.

score 1 · Answer 1 · answered Oct 26 '22 at 21:27

I would suggest to use the Amazon Textract Textractor library pip install amazon-textract-textractor

It makes parsing and using the Textract output much easier than the raw JSON.

from textractor import Textractor

extractor = Textractor(profile_name="default")
document = extractor.detect_document_text('test.png')
print(document.lines)

score 0 · Answer 2 · answered Sep 23 '22 at 16:14

0

No, this is not possible. There are multiple block types, lines link to words via relationships.

Is there some reason why you cannot simply select only the block types you are interested in (lines)?

answered Sep 23 '22 at 16:14

Salaz Numpt

41
4

score 0 · Answer 3 · answered Oct 12 '22 at 06:23

Response will always contain the lines and words. But you can iterate the response['Blocks'] and find only the blocks with BlockType == 'LINES'. Eg. below:

    for block in response["Blocks"]:
        if block["BlockType"] == "LINE":
            print(block)

I am using aws textract StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. I want only lines to be returned, not the single words

3 Answers3