22

I am attempting to use the now supported PDF/TIFF Document Text Detection from the Google Cloud Vision API. Using their example code I am able to submit a PDF and receive back a JSON object with the extracted text. My issue is that the JSON file that is saved to GCS only contains bounding boxes and text for "symbols", i.e. each character in each word. This makes the JSON object quite unwieldy and very difficult to use. I'd like to be able to get the text and bounding boxes for "LINES", "PARAGRAPHS" and "BLOCKS", but I can't seem to find a way to do it via the AsyncAnnotateFileRequest() method.

The sample code is as follows:

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    # Supported mime_types are: 'application/pdf' and 'image/tiff'
    mime_type = 'application/pdf'

    # How many pages should be grouped into each json output file.
    batch_size = 2

    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)

    # Once the request has completed and the output has been
    # written to GCS, we can list all the output files.
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name=bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

    # Process the first output file from GCS.
    # Since we specified batch_size=2, the first response contains
    # the first two pages of the input file.
    output = blob_list[0]

    json_string = output.download_as_string()
    response = json_format.Parse(
        json_string, vision.types.AnnotateFileResponse())

    # The actual response for the first page of the input file.
    first_page_response = response.responses[0]
    annotation = first_page_response.full_text_annotation

    # Here we print the full text from the first page.
    # The response contains more information:
    # annotation/pages/blocks/paragraphs/words/symbols
    # including confidence scores and bounding boxes
    print(u'Full text:\n{}'.format(
        annotation.text))
Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
metersk
  • 11,803
  • 21
  • 63
  • 100
  • https://stackoverflow.com/questions/42391009/text-extraction-line-by-line/54380077#54380077 – Gino Jan 26 '19 at 16:06

1 Answers1

33

Unfortunately when using the DOCUMENT_TEXT_DETECTION type, you can only get the full text per-page, or the individual symbols. It's not too difficult to put together the paragraphs and lines from the symbols though, something like this should work (extending from your example):

breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
paragraphs = []
lines = []

for page in annotation.pages:
    for block in page.blocks:
        for paragraph in block.paragraphs:
            para = ""
            line = ""
            for word in paragraph.words:
                for symbol in word.symbols:
                    line += symbol.text
                    if symbol.property.detected_break.type == breaks.SPACE:
                        line += ' '
                    if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE:
                        line += ' '
                        lines.append(line)
                        para += line
                        line = ''
                    if symbol.property.detected_break.type == breaks.LINE_BREAK:
                        lines.append(line)
                        para += line
                        line = ''
            paragraphs.append(para)

print(paragraphs)
print(lines)
Dustin Ingram
  • 20,502
  • 7
  • 59
  • 82
  • This solution does the same thing as annotation.Text property, which is already built in. – marcus Nov 13 '18 at 12:54
  • No, it doesn't: the original question was originally using `annotation.text`, but that has exactly the problem they were asking about: it doesn't break up the response into lines and paragraphs. This solution does. – Dustin Ingram Nov 13 '18 at 16:44
  • On my end, I'm getting the same results from `annotation.text` and from your code. Don't get me wrong, I like the break type filtering, which is why I voted this answer, but it doesn't improve my output. – marcus Nov 16 '18 at 14:32
  • Yes, the results will be the same, the question is about the structure of the results. – Dustin Ingram Nov 16 '18 at 17:13
  • 1
    One thing I've found about this code is that `symbol.property` doesn't aways exist, which triggers an `AttributeError`. So I wrapped the `if symbol.property...` lines with a `try/except AttributeError` block and ignore the error with `pass`. – donarb May 08 '19 at 15:01
  • I understand that, I can get coordinates of paragraphs.Can I get coordinates of lines? – Amarnath R Jul 31 '19 at 10:28
  • You can get bounding polys (boxes actually) of words and symbols, but not lines. The vertices are page coordinates, but always arranged top-left, top-right, bottom-right, bottom-left, in the local orientation of the symbol/word/paragraph. In my experience, when you have a mixture of text orientations, the association of words to paragraphs is somewhat random and paragraphs will be visually interleaved and overlapping. Messy to decipher... – Eric Schoen Jan 10 '20 at 18:12