5

I was using Document OCR API to extract text from a pdf file, but part of it is not accurate. I found that the reason may be due to the existence of some Chinese characters.

The following is a made-up example in which I cropped part of the region that the extracted text is wrong and add some Chinese characters to reproduce the problem.

Input file

When I use the website version, I cannot get the Chinese characters but the remaining characters are correct.

Result from website version OCR

When I use Python to extract the text, I can get the Chinese characters correctly but part of the remaining characters are wrong.

Result from program

The actual string that I got.

Actual result

Are the versions of Document AI in the website and API different? How can I get all the characters correctly?


Update:

When I print the detected_languages (don't know why for lines = page.lines, the detected_languages for both lines are empty list, need to change to page.blocks or page.paragraphs first) after printing the text, I get the following output.

language code

Code:

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content = image.read()

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))

It seems the language code is wrong, will this affect the result?

iter07
  • 61
  • 7
  • 2
    You should embed the images inside the question itself to make it complete question. External links get destroyed after sometime. – kkgarg Aug 14 '21 at 19:18
  • 1
    Could you provide more details about your scenario because you can use [Document AI OCR](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr) and [Vision OCR](https://cloud.google.com/vision/docs/pdf) to get text from PDF. How many pdf files you want to use, how many pages have those PDFs. Can you share your python code and all your steps? – PjoterS Aug 16 '21 at 10:38
  • @PjoterS I just use the code [here](https://cloud.google.com/document-ai/docs/libraries) to get the text. Other details should be no help in improving the accuracy of the OCR. – iter07 Aug 17 '21 at 01:22
  • And I changed `paragraphs = page.paragraphs` to `lines = page.lines` – iter07 Aug 17 '21 at 01:34
  • Is it possible to provide your full code? – PjoterS Aug 17 '21 at 16:04
  • @PjoterS I think the code should be enough, other parts are irrelevant. – iter07 Aug 18 '21 at 01:24
  • 1
    Thanks for your code. I also got different outputs from your code and demo, however both are using `v1beta3` which is quite strange. It might be related to different endpoints, language alphabet recognition or some random stuff. Is there any reason why you are using DAI OCR? Did you try to use `Vision API` with `DOCUMENT_TEXT_DETECTION` or `TEXT_DETECTION` like mentioned in [Detect text in files (PDF/TIFF)](https://cloud.google.com/vision/docs/pdf)? If you must use `DAI OCR` you could create a report using [Issue Tracker](https://issuetracker.google.com/) for google engineers to verify it. – PjoterS Aug 18 '21 at 13:09
  • @Pjoter I originally was searching for a document splitter, when creating the processor, I see Document AI also provides OCR and it is much more accurate than tesseract. I didn't know Vision API before. – iter07 Aug 19 '21 at 01:35
  • @PjoterS I just tried the Vision API demo and the output is the same as my program output, so I think I don't need to change to Vision API – iter07 Aug 19 '21 at 01:44
  • I still have the exact same issue on V1 API. The Try It gives better results than Python client API – Shivam Miglani Feb 14 '22 at 23:17

1 Answers1

1

Posting this Community Wiki for better visibility.

One of features of DocumentAI is OCR - Optical Character Recognition which allows recognizing text from various files.

OP in this scenario received difference outputs using Try it function and Client Libraries - Python.

Why are there discrepancies between Try it and Python library? It's hard to say as both methods use the same API documentai_v1beta3. It might be related to some files modifications when pdf is uploading to Try it Demo, different endpoints, language alphabet recognition or some random stuff. 1beta3

When you are using Python Client you also get accuracy % of text identification. Below examples from my testes: <pic of my % identification>

However, OP's identification is about 0,73 so it might get wrong results and in this situation is a visible issue. I guess it cannot be anyhow improved using code. Maybe if there would be different quality of PDF (in shown OPs example there are some dots which might affect identification).

PjoterS
  • 12,841
  • 1
  • 22
  • 54
  • Could you explain what is Community Wiki? I don't know what is that... – iter07 Aug 23 '21 at 01:31
  • Hi @Wytrzymały Wiktor, thanks for your welcome. I'm just looking for a solution that can improve the accuracy. I have already posted an issue to Google but not getting any reply. – iter07 Aug 23 '21 at 01:37
  • A community wiki is a post that is able to be maintained by community with less work and does not provide reputation gains to the author. In short it's used when there is no solution for issue but provides some possibilities of the root cause or provide information which could help other community members with similar issue. It can be modified by other users so when something will be fixed in the future it might be changed. More details can be found [here](https://meta.stackexchange.com/questions/11740/what-are-community-wiki-posts) – PjoterS Aug 23 '21 at 11:06
  • @PjoterS Okay, I got it now. Thanks for your community wiki. – iter07 Aug 24 '21 at 01:16
  • @PjoterS Btw, do u know how to remove the dots or increase the quality of the image? – iter07 Aug 24 '21 at 09:13
  • How did you get this file? It was scanned or sent from 3rd party company? You could divide this PDF to some images and try to use Vision OCR as depends on zoom it might provide different results – PjoterS Aug 25 '21 at 10:31