2

I want to load a medical test to S3, analyse it with AWS Textract, extract tables and send to AWS Comprehend Medical. For some reason it takes around 6-8 seconds to run.

Here is what I have done so far and will appreciate your advise or maybe there is a repo with working solution.

import json
import boto3
import sys

def get_rows_columns_map(table_result, blocks_map):
    rows = {}
    for relationship in table_result['Relationships']:
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                cell = blocks_map[child_id]
                if cell['BlockType'] == 'CELL':
                    row_index = cell['RowIndex']
                    col_index = cell['ColumnIndex']
                    if row_index not in rows:
                        # create new row
                        rows[row_index] = {}

                    # get the text value
                    rows[row_index][col_index] = get_text(cell, blocks_map)
    return rows


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] =='SELECTED':
                            text +=  'X '    
    return text


def get_table_csv_results(file_name):

    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        #print('Image loaded', file_name)

    # process using image bytes
    # get the results
    client = boto3.client('textract')

    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])

    # Get the text blocks
    blocks=response['Blocks']
    #pprint(blocks)

    blocks_map = {}
    table_blocks = []
    for block in blocks:
        blocks_map[block['Id']] = block
        if block['BlockType'] == "TABLE":
            table_blocks.append(block)

    if len(table_blocks) <= 0:
        return "<b> NO Table FOUND </b>"

    csv = ''
    for index, table in enumerate(table_blocks):
        csv += generate_table_csv(table, blocks_map, index +1)
        csv += '\n\n'

    return csv

def generate_table_csv(table_result, blocks_map, table_index):
    rows = get_rows_columns_map(table_result, blocks_map)

    table_id = 'Table_' + str(table_index)

    # get cells.
    csv = 'Table: {0}\n\n'.format(table_id)

    for row_index, cols in rows.items():

        for col_index, text in cols.items():
            csv += '{}'.format(text) + ","
        csv += '\n'

    csv += '\n\n\n'
    return csv

def extract_entities(text):
    client = boto3.client(service_name='comprehendmedical')
    result = client.detect_entities_v2(Text=text)
    return result['Entities']

def main(file_name):
    import time
    start_time = time.time()
    table_csv = get_table_csv_results(file_name)
    #print("Entities:")
    entities = extract_entities(table_csv)
    print("--- %s seconds ---" % (time.time() - start_time))
    #output_file = 'output.csv'

    # replace content
    #with open(output_file, "wt") as fout:
    #    fout.write(table_csv)

    # show the results
    #print('CSV OUTPUT FILE: ', output_file)


if __name__ == "__main__":
    file_name = sys.argv[1]
    main(file_name)

Example image: enter image description here

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
SteveS
  • 3,789
  • 5
  • 30
  • 64
  • That's a lot of code to post process the response. Take a look at [amazon-textract-textractor](https://github.com/aws-samples/amazon-textract-textractor) and [pretty-printer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) which provides built in methods to get table data in friendly data formats (such as CSV etc.) – Anjan Biswas Jun 07 '22 at 01:51

2 Answers2

2

Here's an attempt using amazon-textract-textractor and pretty-printer and just a few lines of code to get the table out of your document.

Install the Python libraries

!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter --upgrade

Then you can run the following piece of code to get your table. Note that in my code I am using a Pandas DataFrame, but that is completely optional.

import pandas as pd
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import convert_table_to_list

# SO.png is your document file in Local. input_document argument can also take S3 URI

resp = call_textract(input_document="./SO.png", features=[Textract_Features.TABLES])
tdoc = Document(resp)
dfs = list()

# The loop will look for all pages and all tables in each of the pages in case document is a multi-page PDF file
for page in tdoc.pages:
    for table in page.tables:
        dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))

Output looks like this -

enter image description here

You can also get this table data in CSV format. Like this-

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document="./SO.png", features=[Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.TABLES]))

And you'll get the output like this in csv format --

Test Name ,Result ,Flag ,Reference Range ,Lab 
EEPATIC PUNCTION PANEL ,,,,
"PROTKIN, TOTAL ",6.1 ,,6.1-8.1 9/dL ,XN 
ALBUMIN ,4.3 ,,3.6-5.1 g/d7. ,XN 
GLOBULIN ,1.8 ,LOW ,1.9-3.7 gia (calc) ,EN 
RATIO ,2.4 ,,1.0-2.5 (cale) ,XN 
"BILIRUSIN, POTAL ",0.6 ,,9.2-1.2 ng/d: ,EN 
DIRACT ,0.2 ,,0.2 mg/d: ,xN 
"BILIRUBIN, INDIRICT ",0.4 ,,0.2-1.2 ng/dL (calc) ,EN 
ALKALINE PIOSPEATASI ,61 ,,40-115 U/L ,IN 
AST ,27 ,,10-35 U/L ,EN 
ALT ,19 ,,9-46 0/L ,XN 

If you want both TABLES and FORMS from your document simply add it in features and output_type arguments respectively.

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document="./SO.png", features=[Textract_Features.FORMS, Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.TABLES, Textract_Pretty_Print.FORMS]))

For your document, the output is going to be --

Test Name ,Result ,Flag ,Reference Range ,Lab 
EEPATIC PUNCTION PANEL ,,,,
"PROTKIN, TOTAL ",6.1 ,,6.1-8.1 9/dL ,XN 
ALBUMIN ,4.3 ,,3.6-5.1 g/d7. ,XN 
GLOBULIN ,1.8 ,LOW ,1.9-3.7 gia (calc) ,EN 
RATIO ,2.4 ,,1.0-2.5 (cale) ,XN 
"BILIRUSIN, POTAL ",0.6 ,,9.2-1.2 ng/d: ,EN 
DIRACT ,0.2 ,,0.2 mg/d: ,xN 
"BILIRUBIN, INDIRICT ",0.4 ,,0.2-1.2 ng/dL (calc) ,EN 
ALKALINE PIOSPEATASI ,61 ,,40-115 U/L ,IN 
AST ,27 ,,10-35 U/L ,EN 
ALT ,19 ,,9-46 0/L ,XN 

Key,Value
REPORT STATUS:,FINAL
DOB:,
Performing Laboratory Information:,
AGE:,
SPECIMEN:,
Clinical Info:,

I also did a (naive) benchmark using the code above and it seems to be much faster than the 6 to 8 seconds your code is taking, but please test it.

enter image description here

Anjan Biswas
  • 7,746
  • 5
  • 47
  • 77
0

You need to import time module, generally what happens here is that it will start next task for another document while in processing one already.

def isJobComplete(jobId):
# For production use cases, use SNS based notification 
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))

while(status == "IN_PROGRESS"):
    time.sleep(5)
    response = textract.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
return status

Try to use job which will hold the program to sleep while execution. Also, increase execution time in config of Lambda.