0

I have looked at existing questions, i.e.:

but have failed when attempting to implement the suggested answers, getting incorrect or unexpected responses.

My constant issue (which seems to be a common one) is that Tesseract constantly assumes that the letter I should be read as a pipe character in literally every image I scan in.

I will almost never have a pipe character in what I'm reading -- virtually 100% of those have been letter Is.

I have tried the tessedit_char_blacklist variable to exclude both pipes and exclamation marks - if I omit pipes, it goes to exclamation marks. If both are excluded, the character is just omitted.

I am on Tesseract v5.0.1.20220118 on Windows 10.

Any help would be appreciated; I imagine I can't be the only person who has this issue.

  • You suggest the solution yourself ... if 100% of pipe characters `|`are actually `I`s then fix the problem by replacing them with `I`s in the obtained text. No need for action on Tesseract side. If you don't like to solve it this way provide the images you OCR. Without the images it's hard to guess what actually causes your problem ( bad size/resolution/contrast of the images? Does resizing help? ). – Claudio Aug 22 '22 at 01:34
  • Okay. That could conceivably work for me. But I can open up a brand-new image, type in Arial 44 "I think that I should frankly I do this I do that" and a fresh install of Tesseract - whether it's on Unix or Windows - will parse every one of those as a pipe. I've seen it happen in Unix and Mac too. This isn't a problem everyone experiences? – MHarris1319 Aug 22 '22 at 06:49
  • To my knowledge Tesseract is not designed to work on screenshots as it expects an at 300 dpi scanned image of a text on paper and not antialiased screenfont characters. That is the reason why appropriate enlarging the screenshot can help to improve recognition accuracy. – Claudio Aug 22 '22 at 09:21

0 Answers0