0

I'm using Google's DocumentAI to do the text detection from an image file. While the general output is pretty good, if the image contains mixed texts of Chinese and English, some of the Chinese characters can't be detected and are omitted, so I was wondering if this is related to the language_hints settings (If most of the text is Chinese, then the result is much better than when mixed with English texts). I see here mentions setting processOptions field, but I can't find where in the code to set it.

Here is the image: enter image description here

I tried the following in Python:

    process_options = {
    "ocrConfig": {
        "hints": {
            "languageHints": [
                "zh"
            ]
        }
    }
}

request = documentai_v1beta3.ProcessRequest(name=name, raw_document=raw_document, process_options=process_options)

It gives an error:

ValueError: Protocol message ProcessOptions has no "ocrConfig" field.

Update:

I changed process_options to the following, now the ValueError is gone:

    process_options = {
    "ocr_config": {
        "hints": {
            "language_hints": [
                "zh"
            ]
        }
    }
}

Before setting the language hints, the OCR result:

villain /'vilən/ n 1 EVIL CHARACTER an evil character in a novel, film, play, or other story, especially one who is the main enemy of the hero(小说、电影、戏剧等 H)EÍAŁ‚E 2 CONTEMPTIBLE PERSON any per- son regarded as evil or otherwise contemptible, 3 CAUSE OF PROBLEM a person who or thing that causes a specific evil or problem TOYOTA 4 MISCHIEVOUS PERSON a mischievous or troublesome person () 5 CRIMINAL a criminal (1) 罪犯6【史】 = villein [14 世纪。经由古法语 vilein 'feudal serf' < +Ti villanus 'farmhand' < ti Tvilla 'country home, farm']

After setting the language hints, the OCR result:

villain /'vilən/ n 1 EVIL CHARACTER an evil character in a novel, film, play, or other story, especially one who is the main enemy of the hero(小说、电影、戏剧等 的)反面角色,反面人物 2 CONTEMPTIBLE PERSON any per- son regarded as evil or otherwise contemptible, 3 CAUSE OF PROBLEM a person who or thing that causes a specific evil or problem TOYOTA 4 MISCHIEVOUS PERSON a mischievous or troublesome person〈诙〉淘气鬼,捣蛋鬼 5 CRIMINAL a criminal〈俚〉 罪犯6【史】 = villein[14 世纪。经由古法语 vilein ‘feudal serf’<中世纪拉丁语 villanus 'farmhand' <拉 Tvilla 'country home, farm']

The result is definitely better. But the 2nd and 3rd Chinese translation characters are still not recognized as the red rectangles below indicates. Does anyone know why it sometimes fails to recognize some of the Chinese characters?

p.s. If I only crop out the 2nd and 3rd Chinese translation characters in to individual images, they're perfectly recognized.

some Chinese characters are not recognized

3rd Chinese translation

recognized result:

罪魁祸首,问题的起因

jonah_w
  • 972
  • 5
  • 11

1 Answers1

0

Your second code sample is correct. This sample shows how to add some options to ocr_config the official samples will be updated to include this field once the options are added to the v1 endpoint.

https://github.com/GoogleCloudPlatform/document-ai-samples/tree/main/pdf-embedded-text

Also, some variance in performance can be expected when processing a document that contains multiple languages in the same line (like the example you are using). It also could be worth trying a different processor version such as pretrained-ocr-v1.2-2022-11-10 to see if the results differ.

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21