I am creating an OCR internal tool using aws textract and nodejs to detect text from a scanned pdf, specifically StartDocumentTextDetectionCommand and GetDocumentTextDetectionCommand. Currently returned in a list of block objects with the lines first and then starts detecting each word by word. Is there any way for me to add in a parameter or something where it will just return the lines for me and not the word by word in the pdf.
Asked
Active
Viewed 160 times
3 Answers
1
I would suggest to use the Amazon Textract Textractor library pip install amazon-textract-textractor
It makes parsing and using the Textract output much easier than the raw JSON.
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.detect_document_text('test.png')
print(document.lines)

Thomas
- 676
- 3
- 18
0
No, this is not possible. There are multiple block types, lines link to words via relationships.
Is there some reason why you cannot simply select only the block types you are interested in (lines)?

Salaz Numpt
- 41
- 4
0
Response will always contain the lines and words. But you can iterate the response['Blocks'] and find only the blocks with BlockType == 'LINES'. Eg. below:
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
print(block)

Jayalekshmi R J
- 51
- 7