Limit Google Cloud Vision's Character set

Asked Jul 16 '20 at 18:23

Active Jul 24 '20 at 15:31

Viewed 366 times

I didn't see anything in the documentation about being able to instruct the OCR parser to return only Latin-1 characters (which encodes just the first 256 code points of the Unicode character set). For example, the OCR interpreted a " double quotation mark as ” which looks an awful lot like a double quotation mark but is unicode character \u201d.

Limiting the charset could be a good way to improve OCRing (assuming a document is expected to be in a certain language) and make downstream text processing more predictable. Is this possible?

edited Jul 24 '20 at 15:31

asked Jul 16 '20 at 18:23

zelusp

3,500
3
31
65

1

You could try language_hints if you know the language: https://cloud.google.com/vision/docs/reference/rpc/google.cloud.vision.v1#imagecontext – Brendan Aug 13 '20 at 05:27

Limit Google Cloud Vision's Character set

0 Answers0