Can text structure be retained using Google Cloud Vision TEXT_DETECTION?

Question

Version 1 of the Google Cloud Vision API (beta) permits optical character recognition via TEXT_DETECTION requests. While recognition quality is good, characters are returned without any hint of the original layout. Structured text (e.g., tables, receipts, columnar data) are therefore sometimes incorrectly ordered.

Is it possible to preserve document structure with the Google Cloud Vision API? Similar questions have been asked of tesseract and hOCR. For example, [1] and [2]. There is currently no information about TEXT_DETECTION options in the documentation [3].

[1] How to preserve document structure in tesseract [2] Tesseract - ambiguity in space and tab [3] https://cloud.google.com/vision/

From what I could get each chunk of text recognized by the API comes with coordinates. So, if you know some kind of text is likely to be on the top of the image, you may try to investigate chunks that are placed on the top; if you need to check a sum from a table of values, for example, you may want to investigate text recognized on the right bottom corner of the img. I know it's far from the ideal scenario, and I also thought it was going to be easier before I sent a real sample to the API. But that's all I can think about for now to try to resolve this "problem". — Cotta, Jun 26 '16 at 18:29

score 2 · Answer 1 · answered Feb 20 '16 at 16:33

Recognizing the text structure is a more abstract concept than to recognize the text itself : letters,words,sentence. If you already have this text structure information in your file metadata you could do something like :

Segment/divide your input image in subparts.
Execute your text_detection requests.
Re-order your text correctly based on your meta-data.

I'm not an expert in Cloud Vision text_detection API but it's written text_detection not language_detection or text_structure_detection, so it gives some little clues about the detection level/layer.

Maybe it's a feature they are planning to add in the future or describe in the documentation.

Fees are per-image, so dividing the image into subparts would potentially be very costly for complicated structures. — user3761401, Feb 20 '16 at 18:25

Can text structure be retained using Google Cloud Vision TEXT_DETECTION?

1 Answers1