I am in the process of automating an AWS Textract flow where files gets uploaded to S3 using an app (that I have already done), a lambda function gets triggered, extracts the forms as a CSV, and saves it in the same bucket.
I have started this with just a Textract formula for all the text in the image, with the result being a .txt file. Below is my code:
def InvokeTextract(bucketName, documentKey):
print('Loading InvokeTextract')
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': bucketName,
'Name': documentKey
}
})
Textractoutput = ''
# Print detected text
for item in response['Blocks']:
if item['BlockType'] == 'LINE':
Textractoutput += item['Text'] + '\n'
return Textractoutput
def writeOutputToS3Bucket(textractData, bucketName, createdS3Document):
print('Loading writeOutputToS3Bucket')
generateFilePath = os.path.splitext(createdS3Document)[0] + '.txt'
s3.put_object(Body=textractData, Bucket=bucketName, Key=generateFilePath)
print('Generated ' + generateFilePath)
def lambda_handler(event, context):
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
Textractoutput = InvokeTextract(bucket, key)
writeOutputToS3Bucket(Textractoutput, bucket, key)
return 'Processed'
And this work just fine, but if I want to get key-value pairs, this isn't helpful. So, I tried to use another code for CSV. From my local drive, I was able to do that. Below is part of my code:
import trp #Local Module
import csv
doc = Document(response) #from TRP
with open('aws_doc.csv', mode='w') as aws_field_file:
field_write = csv.writer(aws_field_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
field_write.writerow(["Key", "Value"])
for page in doc.pages:
for field in page.form.fields:
# This will write it as your <key>, <value>
field_write.writerow([field.key, field.value])
But when I am trying to code this using Lambda, I am not getting the results (i.e. a CSV file in my bucket). I read about it and I found I needed to create a tmp file, but that was a bit unclear. I went with this code below:
def lambda_handler(event, context):
# Get the object from the event and show its content type
bucketName = event['Records'][0]['s3']['bucket']['name']
documentKey = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
#S3 client
s3 = boto3.resource('s3')
# Amazon Textract client
textract = boto3.client('textract')
# Get AWS Textract Response for Forms
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': bucketName,
'Name': documentKey
}
},
FeatureTypes = ["FORMS"])
# Using custom trp module
doc = Document(response)
import csv
temp_csv_file = csv.writer(open("/tmp/csv_file.csv", "w+"))
temp_csv_file.writerow(["Key", "Value"])
for page in doc.pages:
for field in page.form.fields:
# This will write it as your <key>, <value>
temp_csv_file.writerow([field.key, field.value])
bucketName.upload_file('/tmp/csv_file.csv', 'textractData.csv')
Is my code correct? Am I missing a step in there?