3

I have 1000s of survey forms which I need to scan and then upload onto my C# system in order to extract the data and enter it into a database. The surveys are a mix of hand-written 1) text boxes and 2) checkboxes. I am currently using the the Azure Read Api to extract hand-written text which should work fine e.g. question #4 below returns 'Python' and 'coding'.

So my question; will any AWS Textract give me the capability to extract data for which checkbox is marked? e.g. see question #1 below - I need a string back saying 'disagree', is this possible with any AWS Textract API?

Azure Read API and Google Vision OCR do not offer this functionality unfortunately so if AWS Textract doesn't help me with this I will have to do something manual like checking changes in pixel color to detect ticked checkboxes.

Survey type:

Text

darego101
  • 319
  • 2
  • 15

1 Answers1

3

Yes, Amazon Textract supports detection of various field inputs like checkboxes and radio buttons. You can read more about the details in the docs here and here.

I wrote a quick script to call Textract for your image with the following code, which properly identified the keys and values for the different form fields, in addition to identifying whether a given field was selected/unselected.

# python 3
import boto3

# instantiate client
textract = boto3.client('textract')

# read image bytes
with open("textract-test.png", "rb") as image:
  f = image.read()
  image_data = bytearray(f)
  print(image_data[0])

# call textract endpoint
textract.analyze_document(Document={'Bytes': image_data}, FeatureTypes=['FORMS'])

The resulting output will be a series of "blocks", which represent individual blocks of text or form inputs. Parsing this JSON, we can find blocks that correspond to selected checked boxes that resemble the following:

"Id": "0abb6f4e-4512-4581-b261-a45f2426973f",
      "SelectionStatus": "SELECTED" // value of interest. Alternatively, "NOT_SELECTED"
    },
    {
      "BlockType": "SELECTION_ELEMENT",
      "Confidence": 54.00064468383789,
      "Geometry": {
        "BoundingBox": {
          "Width": 0.030619779601693153,
          "Height": 0.024501724168658257,
          "Left": 0.4210366904735565,
          "Top": 0.439885675907135
        },
        "Polygon": [
          {
            "X": 0.4210366904735565,
            "Y": 0.439885675907135
          },
          {
            "X": 0.4516564607620239,
            "Y": 0.439885675907135
          },
          {
            "X": 0.4516564607620239,
            "Y": 0.4643873870372772
          },
          {
            "X": 0.4210366904735565,
            "Y": 0.4643873870372772
          }
        ]
      },

Apologies for not whipping up an example in C#, but you can leverage Textract via the CLI or the AWS .NET SDK for similar effects.


Note: If you're looking to just get a feel for what response Amazon Textract will return for your data, you can navigate to the Amazon Textract page in the AWS Management Console and use the image test application in there. You can use the GUI to visualize some of the results, or download the API responses in their entirety.enter image description here

Nick Walsh
  • 1,807
  • 5
  • 16
  • many thanks for the quick and in-debt response! I am going to look into trying to get an example working in c# to see how accurate the results will be. In regard hand-writing; could Textract also extract the hand-written text data from my boxes (e.g. 'Python' and 'coding' in question 4 above)? That would save me from using Azure Read API also – darego101 Nov 16 '19 at 01:25
  • 1
    Yup! It is all available as results under the analyze document method. – Nick Walsh Nov 16 '19 at 01:27
  • fantastic! thanks a lot Nick. it looks like AWS beats Azure and Google yet again! :) – darego101 Nov 16 '19 at 01:28
  • unfortunately it looks like the AWS Textract API is not available at the minute. https://stackoverflow.com/questions/58922839/other-options-for-aws-textract-net-sdk – darego101 Nov 18 '19 at 21:25
  • 1
    @darego101 The Textract API is definitely available: your two options of using it would be to run a [subprocess call in C#](https://stackoverflow.com/questions/1469764/run-command-prompt-commands) to the [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/transcribe/start-transcription-job.html), or using the [AWS Transcribe .NET SDK](https://docs.aws.amazon.com/sdkfornet/v3/apidocs/index.html?page=Textract/MTextractAnalyzeDocumentAnalyzeDocumentRequest.html) directly – Nick Walsh Nov 19 '19 at 21:55
  • Thanks for another reply Nick. I mean to say that the AWS .NET SDK .Textract extension specifically is not fully available yet as you can see on their .NET SDK GitHub under Textract: https://i.stack.imgur.com/H9KE2.png – darego101 Nov 19 '19 at 22:25
  • 1
    @darego101 have you tried installing the `awscli` and invoking it using C#'s `System.Diagnostics.Process.Start()`? I posted references to a relevant SO question as well as the `start-transcription-job` command in the previous comment. – Nick Walsh Nov 20 '19 at 18:35