Japanese OCR for GCP Document AI custom processor

Question

I am training the GCP Document AI custom processor for my project. It seems the processor does not recognize Japanese text at all. Is there an option to enable Japanese language support?

score 2 · Accepted Answer · answered Mar 08 '23 at 11:08

Currently in Custom Document Extractor, ja: Japanese language is not supported.

If you want the feature of Japanese language support for Custom Document Extractor to be implemented, you can open a new feature request on the issue tracker describing your requirement.

For more information regarding custom processor you can refer to this documentation.

Holt Skinner · Answer 2 · 2023-03-08T18:40:02.087

2

This comment is accurate. Custom Document Extractor currently doesn't support Japanese, but it is on the product roadmap for H1 2023. There is a workaround that could work for you until the feature is implemented.

Note: This is not intended to be a permanent solution, but it can increase language capabilities for Document AI Workbench for the time being.

Pre-process your documents for training with the Document OCR processor which supports Japanese.
Save the output ProcessResponse JSON files, then remove the HumanReviewStatus and unwrap the Document object.
- (i.e. the JSON should start with uri: "").
Import the Document JSON files you have created into a Document AI Workbench Dataset and label the documents.
- Note: Schema Labels can only be defined in English.
During prediction, pre-process your documents with the Document OCR Processor then send the output into the the Custom Document Extractor for prediction.
- Note: This only works for online processing, not batch processing

edited Mar 08 '23 at 18:40

answered Mar 08 '23 at 17:33

Holt Skinner

1,692
1
8
21

We have tried your steps (from 1 to 3) and it works very well. But in step 4, we tried to upload a json file (using code) to test how well the processor perform, but it says: Unsupported input file format. In step 4 your said: ```Note: This only works for online processing, not batch processing``` Did you mean that this solution can only be applied in training phase, and cannot be used in evaluation/testing phase and final usage phase? – code đờ Mar 30 '23 at 02:56
1

You don't upload a JSON file directly for the second phase of processing. You can use it in the final usage phase, but you have to use an online Processing Request. You will use the `inlineDocument` field in the API request to provide the `Document` object output from the OCR processor as input to the Custom Document Extractor processor. This is why it works only for Online Processing, because you can't specify an `inlineDocument` or a JSON input file with Batch processing. https://cloud.google.com/document-ai/docs/send-request#online-processor – Holt Skinner Mar 30 '23 at 14:51
Thank you for your reply, it's extremely helpful to us. We used the output of OCR Processor as input of Custom Document Extractor as you described above, it worked amazingly. But we still have issue recognizing checkboxes in the scanned PDF, do you have any suggestion for us? We thought about using FormParser but it's costly, however we couldn't think about anything else. The checkboxes sample is here: https://imgur.com/OJgUHuz – code đờ Apr 07 '23 at 06:34
please help me with the comment above. Thank you in advance! – code đờ Apr 10 '23 at 07:33
You should be able to create a Checkbox data type for the custom document extractor. See here https://cloud.google.com/document-ai/docs/workbench/create-dataset#choose_label_attributes If you're already creating this datatype and it's not working, it's possible that this workaround doesn't work well with checkboxes. Depending on how urgent this is, it might make sense to wait for the expanded language support to be added to Custom Document Extractor. – Holt Skinner Apr 10 '23 at 14:46

Japanese OCR for GCP Document AI custom processor

2 Answers2