8

I recently came across PaddleOCR and am wondering, how this OCR system compares to Tesseract. Is there any data or benchmarks available?

user123206
  • 107
  • 1
  • 1
  • 4

5 Answers5

6

I have been using both in some research for almost a year. I can say that each has its own perfect use.

PaddleOCR PROs:

  1. If the text is rotated in non-90-degree rotations, PaddleOCR can still detect some text correctly, but Tesseract cannot do this even if OSD is used.
  2. You can use the detection results to fix the rotation, but Tesseract is likely to retrieve non-sense results.
  3. PaddleOCR works better than Tesseract when images in RGB/BGR if you can't binarize your image.

Tesseract PROs:

  1. PaddleOCR has serious problems with detecting spaces even after manipulating the parameters, but they are working on fixing this in the next upgrade. Tesseract doesn't have significant spaces problems.
  2. Tesseract is better in terms of processing scanned documents.
  3. Image segmentation modes are to the rescue and help a lot with improving the results.
  4. Tesseract results on binarized images with long text are usually better than PaddleOCR.
  5. Tesseract is far better at detecting symbols.
  6. Tesseract is faster on CPU.

In short, using Tesseract would be perfect for scanned documents and PaddleOCR for general computer vision projects.

Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25
4

I have used Tesseract for a moment, but it suffers from accuracy, for example the number 4 is recognized as A, 1 as ], 8 as & and so on,

Now i switch to paddleOCR, it has a great recognizing level when using the good detection/classification and recognition models.

Comparing the Tesseract OCR result with the paddleOCR result on text recognition, paddleOCR beats Tesseract.

however, it stills some problems with paddleOCR, sometimes blanks are missing, some words/numbers are not well recognized, even if the image quality is very good.

I have made a research to solve the issue and i see 6 possible solutions:

1. Postprocess the output of paddleOCR:

To recognize well the different types of information you are manipulating. I have implemented this solution by default before making the research on how to increase paddleOCR result efficiency. But it has come to an end, the changes piled up more and more and they will become unmanageable.

I have also used tabula to get the bad recognized text, by using it only for the region which i have encountered the recognition failure.

2. Use spelling correction:

You can use a spelling correction libraries like pyspellchecker or autocorrect to correct any spelling mistakes in the recognized text.

3. Training your paddleOCR models more and more on your datasets: This is currently what i am doing, i'm trying to train paddleOCR on my own datasets, i use labelimg for annotation to prepare datasets, i have also developed a script to autogenerate annotations for labelimg, then i check them quicly to correct recognition errors, this technique allows me to decrease the spent time to prepare these datasets

4. Use a language model: You can use a language model like GPT-3 or BERT to postprocess the recognized text. these tools are used for natural language understanding and answering, you can train them to recognize your text. This will be the next step, i will use

5. Use a post-processing pipeline: You can create a custom post-processing pipeline which uses the combination of spelling correction libraries and language representation models.

6. You change the OCR: Explore other OCR, for the moment, i don't think to change paddleOCR as it has a good recognition level, but all is possible.

elhadi dp ıpɐɥןǝ
  • 4,763
  • 2
  • 30
  • 34
2

I found a comparison between PaddleOCR 2 and Tesseract 4, but only for English texts. Briefly summarized:

  1. PaddleOCR is slightly slower than Tesseract on CPUs, but with GPU support it beats Tesseract by 46% on a standard-GPU.
  2. Without post-processing, PaddleOCR mainly makes mistakes with missing white spaces between words and punctuation symbols. However, these errors can be easily corrected. After postprocessing the accuracy is comparable to Tesseract (1% less).
  3. The pre-trained model for English has only 10% of the file size of Tesseracts English train data (2MB vs 23MB).

For Chinese texts, which seem to be the main priortiy of PaddleOCR at the moment, the situation could be different.

2

Recently PaddleOCR updated the v3 version, and the English space problem has been significantly improved. I tried the English model, it works very well.

In document scenarios, PaddleOCR can achieve 95%+ accuracy. But Tesseract may be confused on some rhythmic characters.

In particular, PaddleOCR's performance in some non-Latin languages ​​is beyond my imagination. For example Arabic, the effect is far better than EasyOCR and Tesseract

Highly recommend PaddleOCR!!!

xiaoting
  • 21
  • 2
1

I tested English and Japanese with them but interestingly PaddleOCR seems to recognize both of them better than Tesseract. PaddleOCR's text detection also seems better. However according to their posts, PaddleOCR cannot handle spaces very well and there are complaints from non Chinese (or Japanese) users. PaddleOCR is very eager in incorporating the latest recognition/detection algorithms published as research papers, for which I have decided to use PaddleOCR.

peko
  • 11
  • 2