TextSizeLimitExceededException when calling the DetectPiiEntities operation

Question

I am using aws comprehend for PII redaction, Idea is to detect entities and then redact PII from it.

Now the problem is this API has a Input text size limit. How can I increase the limit ?? Maybe to 1 MB ?? Or is there any other way to detect entities for large text.

ERROR: botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the DetectPiiEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 7776 bytes

Any updates on this? – Andrey Bulezyuk Dec 16 '21 at 21:03 — Andrey Bulezyuk, Dec 16 '21 at 21:03

score 0 · Answer 1 · answered Mar 07 '22 at 07:59

There's no way to increase this limit. For input text greater than 5000 bytes, you can split the text into multiple chunks of 5000 bytes each and then aggregate the results back. Please do mind that you keep some overlap between different chunks, to carry over some context from previous chunk.

For reference you can use similar solution exposed by Comprehend team itself . https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172

grantr · Answer 2 · 2023-05-18T14:26:17.807

There isn't a straightforward example, so here is one using the official segmenter/de-segmentor from aws: https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172. I converted the code in the link to a package and then imported the Segmenter. I had to adjust the imports in all of the python scripts.

Instructions to replicate code below:

go to link and download entire src folder
rename the folder "comprehend_utils"
fix imports in all files until the below code runs

    from comprehend_utils.processors import Segmenter
    import boto3
    import os
    import pandas as pd
    
    os.environ['AWS_SECRET_ACCESS_KEY'] = ''
    os.environ['AWS_ACCESS_KEY_ID']=''
    
    def get_results(text):
        client = boto3.client(service_name='comprehendmedical', region_name='us-east-1')
        result = client.detect_entities(Text= text)
        entities = result['Entities']
        for entity in entities:
            print('Entity', entity)
        return entities
    ##text file
    file_name = 'yourfile.txt'
    with open(file_name, 'rb') as f:
        text = f.read()
    text = text.decode('utf-8')
    segmentor = Segmenter(2000)
    document_list = segmentor.segment(text)    
    
    for r in document_list:
        entities = get_results(r.text)
        r.pii_entities= entities
        
    final_output = segmentor.de_segment(document_list)
    
    df = pd.DataFrame(final_output.pii_entities)
    df.to_csv(f'{file_name}_output.csv')

@Manasshukla it's only on my desktop right now. You could download the code from the link and manually fix the imports. I edited the post with instructions. — grantr, May 18 '23 at 14:18

TextSizeLimitExceededException when calling the DetectPiiEntities operation

2 Answers2