I am heavily relying on OCR for a project I have been working on, however with my limited understanding of the field I am not sure how to proceed.
I have a list of pdf documents that need to be converted to text. And when they are, through pytesseract, the output isn't the best in terms of text conversion efficiency, most words are misspelled. After some searching and learning, I tried preprocessing. Once binarized, I skeletonized the documents and this seems to have decreased the efficiency even further. The next thing to experiment with is a thinning algorithm and finding a sweet spot between the initial thickness and the skeleton thickness which gives the best output. I am curious whether there is a better option to increase the efficiency?
P.S. - The pdf files vary in quality a lot. There are some files that have almost 0 errors when run without preprocessing.