I'm using Google's DocumentAI to do the text detection from an image file. While the general output is pretty good, if the image contains mixed texts of Chinese and English, some of the Chinese characters can't be detected and are omitted, so I was wondering if this is related to the language_hints settings (If most of the text is Chinese, then the result is much better than when mixed with English texts). I see here mentions setting processOptions
field, but I can't find where in the code to set it.
I tried the following in Python:
process_options = {
"ocrConfig": {
"hints": {
"languageHints": [
"zh"
]
}
}
}
request = documentai_v1beta3.ProcessRequest(name=name, raw_document=raw_document, process_options=process_options)
It gives an error:
ValueError: Protocol message ProcessOptions has no "ocrConfig" field.
Update:
I changed process_options to the following, now the ValueError is gone:
process_options = {
"ocr_config": {
"hints": {
"language_hints": [
"zh"
]
}
}
}
Before setting the language hints, the OCR result:
villain /'vilən/ n 1 EVIL CHARACTER an evil character in a novel, film, play, or other story, especially one who is the main enemy of the hero(小说、电影、戏剧等 H)EÍAŁ‚E 2 CONTEMPTIBLE PERSON any per- son regarded as evil or otherwise contemptible, 3 CAUSE OF PROBLEM a person who or thing that causes a specific evil or problem TOYOTA 4 MISCHIEVOUS PERSON a mischievous or troublesome person () 5 CRIMINAL a criminal (1) 罪犯6【史】 = villein [14 世纪。经由古法语 vilein 'feudal serf' < +Ti villanus 'farmhand' < ti Tvilla 'country home, farm']
After setting the language hints, the OCR result:
villain /'vilən/ n 1 EVIL CHARACTER an evil character in a novel, film, play, or other story, especially one who is the main enemy of the hero(小说、电影、戏剧等 的)反面角色,反面人物 2 CONTEMPTIBLE PERSON any per- son regarded as evil or otherwise contemptible, 3 CAUSE OF PROBLEM a person who or thing that causes a specific evil or problem TOYOTA 4 MISCHIEVOUS PERSON a mischievous or troublesome person〈诙〉淘气鬼,捣蛋鬼 5 CRIMINAL a criminal〈俚〉 罪犯6【史】 = villein[14 世纪。经由古法语 vilein ‘feudal serf’<中世纪拉丁语 villanus 'farmhand' <拉 Tvilla 'country home, farm']
The result is definitely better. But the 2nd and 3rd Chinese translation characters are still not recognized as the red rectangles below indicates. Does anyone know why it sometimes fails to recognize some of the Chinese characters?
p.s. If I only crop out the 2nd and 3rd Chinese translation characters in to individual images, they're perfectly recognized.
recognized result:
罪魁祸首,问题的起因