4

I didn't see anything in the documentation about being able to instruct the OCR parser to return only Latin-1 characters (which encodes just the first 256 code points of the Unicode character set). For example, the OCR interpreted a " double quotation mark as which looks an awful lot like a double quotation mark but is unicode character \u201d.

Limiting the charset could be a good way to improve OCRing (assuming a document is expected to be in a certain language) and make downstream text processing more predictable. Is this possible?

zelusp
  • 3,500
  • 3
  • 31
  • 65
  • 1
    You could try language_hints if you know the language: https://cloud.google.com/vision/docs/reference/rpc/google.cloud.vision.v1#imagecontext – Brendan Aug 13 '20 at 05:27

0 Answers0