Parse medical tests and extract tables and key values using Python and AWS?

Question

I want to load a medical test to S3, analyse it with AWS Textract, extract tables and send to AWS Comprehend Medical. For some reason it takes around 6-8 seconds to run.

Here is what I have done so far and will appreciate your advise or maybe there is a repo with working solution.

import json
import boto3
import sys

def get_rows_columns_map(table_result, blocks_map):
    rows = {}
    for relationship in table_result['Relationships']:
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                cell = blocks_map[child_id]
                if cell['BlockType'] == 'CELL':
                    row_index = cell['RowIndex']
                    col_index = cell['ColumnIndex']
                    if row_index not in rows:
                        # create new row
                        rows[row_index] = {}

                    # get the text value
                    rows[row_index][col_index] = get_text(cell, blocks_map)
    return rows


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] =='SELECTED':
                            text +=  'X '    
    return text


def get_table_csv_results(file_name):

    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        #print('Image loaded', file_name)

    # process using image bytes
    # get the results
    client = boto3.client('textract')

    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])

    # Get the text blocks
    blocks=response['Blocks']
    #pprint(blocks)

    blocks_map = {}
    table_blocks = []
    for block in blocks:
        blocks_map[block['Id']] = block
        if block['BlockType'] == "TABLE":
            table_blocks.append(block)

    if len(table_blocks) <= 0:
        return "<b> NO Table FOUND </b>"

    csv = ''
    for index, table in enumerate(table_blocks):
        csv += generate_table_csv(table, blocks_map, index +1)
        csv += '\n\n'

    return csv

def generate_table_csv(table_result, blocks_map, table_index):
    rows = get_rows_columns_map(table_result, blocks_map)

    table_id = 'Table_' + str(table_index)

    # get cells.
    csv = 'Table: {0}\n\n'.format(table_id)

    for row_index, cols in rows.items():

        for col_index, text in cols.items():
            csv += '{}'.format(text) + ","
        csv += '\n'

    csv += '\n\n\n'
    return csv

def extract_entities(text):
    client = boto3.client(service_name='comprehendmedical')
    result = client.detect_entities_v2(Text=text)
    return result['Entities']

def main(file_name):
    import time
    start_time = time.time()
    table_csv = get_table_csv_results(file_name)
    #print("Entities:")
    entities = extract_entities(table_csv)
    print("--- %s seconds ---" % (time.time() - start_time))
    #output_file = 'output.csv'

    # replace content
    #with open(output_file, "wt") as fout:
    #    fout.write(table_csv)

    # show the results
    #print('CSV OUTPUT FILE: ', output_file)


if __name__ == "__main__":
    file_name = sys.argv[1]
    main(file_name)

Example image:

That's a lot of code to post process the response. Take a look at [amazon-textract-textractor](https://github.com/aws-samples/amazon-textract-textractor) and [pretty-printer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) which provides built in methods to get table data in friendly data formats (such as CSV etc.) — Anjan Biswas, Jun 07 '22 at 01:51

Anjan Biswas · Answer 1 · 2022-06-07T02:13:25.080

Here's an attempt using amazon-textract-textractor and pretty-printer and just a few lines of code to get the table out of your document.

Install the Python libraries

!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter --upgrade

Then you can run the following piece of code to get your table. Note that in my code I am using a Pandas DataFrame, but that is completely optional.

import pandas as pd
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import convert_table_to_list

# SO.png is your document file in Local. input_document argument can also take S3 URI

resp = call_textract(input_document="./SO.png", features=[Textract_Features.TABLES])
tdoc = Document(resp)
dfs = list()

# The loop will look for all pages and all tables in each of the pages in case document is a multi-page PDF file
for page in tdoc.pages:
    for table in page.tables:
        dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))

Output looks like this -

You can also get this table data in CSV format. Like this-

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document="./SO.png", features=[Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.TABLES]))

And you'll get the output like this in csv format --

Test Name ,Result ,Flag ,Reference Range ,Lab 
EEPATIC PUNCTION PANEL ,,,,
"PROTKIN, TOTAL ",6.1 ,,6.1-8.1 9/dL ,XN 
ALBUMIN ,4.3 ,,3.6-5.1 g/d7. ,XN 
GLOBULIN ,1.8 ,LOW ,1.9-3.7 gia (calc) ,EN 
RATIO ,2.4 ,,1.0-2.5 (cale) ,XN 
"BILIRUSIN, POTAL ",0.6 ,,9.2-1.2 ng/d: ,EN 
DIRACT ,0.2 ,,0.2 mg/d: ,xN 
"BILIRUBIN, INDIRICT ",0.4 ,,0.2-1.2 ng/dL (calc) ,EN 
ALKALINE PIOSPEATASI ,61 ,,40-115 U/L ,IN 
AST ,27 ,,10-35 U/L ,EN 
ALT ,19 ,,9-46 0/L ,XN

If you want both TABLES and FORMS from your document simply add it in features and output_type arguments respectively.

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

textract_json = call_textract(input_document="./SO.png", features=[Textract_Features.FORMS, Textract_Features.TABLES])
print(get_string(textract_json=textract_json,
               table_format=Pretty_Print_Table_Format.csv,
               output_type=[Textract_Pretty_Print.TABLES, Textract_Pretty_Print.FORMS]))

For your document, the output is going to be --

Test Name ,Result ,Flag ,Reference Range ,Lab 
EEPATIC PUNCTION PANEL ,,,,
"PROTKIN, TOTAL ",6.1 ,,6.1-8.1 9/dL ,XN 
ALBUMIN ,4.3 ,,3.6-5.1 g/d7. ,XN 
GLOBULIN ,1.8 ,LOW ,1.9-3.7 gia (calc) ,EN 
RATIO ,2.4 ,,1.0-2.5 (cale) ,XN 
"BILIRUSIN, POTAL ",0.6 ,,9.2-1.2 ng/d: ,EN 
DIRACT ,0.2 ,,0.2 mg/d: ,xN 
"BILIRUBIN, INDIRICT ",0.4 ,,0.2-1.2 ng/dL (calc) ,EN 
ALKALINE PIOSPEATASI ,61 ,,40-115 U/L ,IN 
AST ,27 ,,10-35 U/L ,EN 
ALT ,19 ,,9-46 0/L ,XN 

Key,Value
REPORT STATUS:,FINAL
DOB:,
Performing Laboratory Information:,
AGE:,
SPECIMEN:,
Clinical Info:,

I also did a (naive) benchmark using the code above and it seems to be much faster than the 6 to 8 seconds your code is taking, but please test it.

score 0 · Answer 2 · answered May 30 '22 at 14:59

You need to import time module, generally what happens here is that it will start next task for another document while in processing one already.

def isJobComplete(jobId):
# For production use cases, use SNS based notification 
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))

while(status == "IN_PROGRESS"):
    time.sleep(5)
    response = textract.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))
return status

Try to use job which will hold the program to sleep while execution. Also, increase execution time in config of Lambda.

Parse medical tests and extract tables and key values using Python and AWS?

2 Answers2