2

I am using aws comprehend for PII redaction, Idea is to detect entities and then redact PII from it.

Now the problem is this API has a Input text size limit. How can I increase the limit ?? Maybe to 1 MB ?? Or is there any other way to detect entities for large text.

ERROR: botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the DetectPiiEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 7776 bytes

Kush Verma
  • 35
  • 1
  • 7

2 Answers2

0

There's no way to increase this limit. For input text greater than 5000 bytes, you can split the text into multiple chunks of 5000 bytes each and then aggregate the results back. Please do mind that you keep some overlap between different chunks, to carry over some context from previous chunk.

For reference you can use similar solution exposed by Comprehend team itself . https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172

0

There isn't a straightforward example, so here is one using the official segmenter/de-segmentor from aws: https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172. I converted the code in the link to a package and then imported the Segmenter. I had to adjust the imports in all of the python scripts.


Instructions to replicate code below:

  1. go to link and download entire src folder
  2. rename the folder "comprehend_utils"
  3. fix imports in all files until the below code runs
    from comprehend_utils.processors import Segmenter
    import boto3
    import os
    import pandas as pd
    
    os.environ['AWS_SECRET_ACCESS_KEY'] = ''
    os.environ['AWS_ACCESS_KEY_ID']=''
    
    def get_results(text):
        client = boto3.client(service_name='comprehendmedical', region_name='us-east-1')
        result = client.detect_entities(Text= text)
        entities = result['Entities']
        for entity in entities:
            print('Entity', entity)
        return entities
    ##text file
    file_name = 'yourfile.txt'
    with open(file_name, 'rb') as f:
        text = f.read()
    text = text.decode('utf-8')
    segmentor = Segmenter(2000)
    document_list = segmentor.segment(text)    
    
    for r in document_list:
        entities = get_results(r.text)
        r.pii_entities= entities
        
    final_output = segmentor.de_segment(document_list)
    
    df = pd.DataFrame(final_output.pii_entities)
    df.to_csv(f'{file_name}_output.csv')
grantr
  • 878
  • 8
  • 16
  • can you share the package link with correct imports – Manas shukla May 18 '23 at 09:29
  • @Manasshukla it's only on my desktop right now. You could download the code from the link and manually fix the imports. I edited the post with instructions. – grantr May 18 '23 at 14:18