0

I am heavily relying on OCR for a project I have been working on, however with my limited understanding of the field I am not sure how to proceed.

I have a list of pdf documents that need to be converted to text. And when they are, through pytesseract, the output isn't the best in terms of text conversion efficiency, most words are misspelled. After some searching and learning, I tried preprocessing. Once binarized, I skeletonized the documents and this seems to have decreased the efficiency even further. The next thing to experiment with is a thinning algorithm and finding a sweet spot between the initial thickness and the skeleton thickness which gives the best output. I am curious whether there is a better option to increase the efficiency?

P.S. - The pdf files vary in quality a lot. There are some files that have almost 0 errors when run without preprocessing.

Chinmay
  • 159
  • 9
  • 1
    Also have a look at this: https://stackoverflow.com/questions/28935983/preprocessing-image-for-tesseract-ocr-with-opencv – Jeru Luke Jun 29 '22 at 08:11
  • 1
    do not "preprocess". tesseract is simply that bad. you could train it on your specific data but in your place I'd just use a different OCR package, one that is closer to state-of-the-art. I've heard of "easyocr", which performs consistently better than tesseract. – Christoph Rackwitz Jun 29 '22 at 08:27

0 Answers0