I didn't see anything in the documentation about being able to instruct the OCR parser to return only Latin-1
characters (which encodes just the first 256 code points of the Unicode character set). For example, the OCR interpreted a "
double quotation mark as ”
which looks an awful lot like a double quotation mark but is unicode
character \u201d
.
Limiting the charset could be a good way to improve OCRing (assuming a document is expected to be in a certain language) and make downstream text processing more predictable. Is this possible?