Improve Tesseract OCR results by removing special characters

Question

I am doing the tesseract conversion on some pdf, image, tiff files saved in my db. But while doing it I am getting lots of garbage text output from various files. For example, in this case the image gave me the following text output.

â€œâ€˜55â€œ .'Hï¬ï¬jï¬tï¬tfâ€˜Nâ€˜Dï¬‚iâ€™iisifagï¬'aï¬fï¬‚â€˜rfÃ©-wt-â€œï¬â€˜-:-'!W',ï¬‚':ï¬fm:afJuirzv-int'g-v "3.0:â€ _â€˜ l 1: v .w 

From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006

As you can see it adds some extra special characters in the starting.

just want to know if there is any control param for removing such special characters from the output, because this is happening with many input files.

Note: This is not the original image, this is only the part of screenshot of pdf that I am converting to text and also the output is a part of original output.

My question is not similar to Limit characters tesseract is looking for , because that question is for ignoring things other than letters, but in my case there are some unwanted letters, numbers in the output text, which I need to remove after using the tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz I am still getting wrong text he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an in starting of output text and also it removes the numbers too. So just want to ask whether is there any way of removing these unwanted letters, special characters, numbers that are appearing in the starting.

Possible duplicate of [Limit characters tesseract is looking for](http://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for) — sashoalm, Apr 25 '17 at 10:44
@sashoalm after doing this I am getting this he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnn‌ve mania a i v a an izromrseaver medical internal tilted sos min seaa oslaelaoie urea arses ptoosloos numbers got disappeared. I also want the number what can be done and also need to remove the unwanted text like he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnn‌ve mania a i v a an from starting. — Vibhor Bhatnagar, Apr 26 '17 at 08:04
I tried adding this `tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123456789@.` in config also, but the output is `h5g11hwvwhfvt7713fybcfuisiwiggfiwfarwtrtnifrrleiixwtfhfmjafguiuwginnve mam.u a 5 1r v .w lzromrseaver medical internal tilled. 909 797 8922 0612812016 11124 91946 p.0051006` , so still problem exists. Can you help with this ? — Vibhor Bhatnagar, Apr 26 '17 at 08:22
@VibhorBhatnagar This text - `From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006` - from question above is the `tesseract` output apart from the leading line of garbage characters? If so, it looks the `OCR` accuracy still need to improve as there is still some incorrect characters recognized? — thewaywewere, Apr 29 '17 at 16:44
@thewaywewere Yes this text is the`tesseract` output along with the leading line of garbage characters. What can be done in this case ? — Vibhor Bhatnagar, May 01 '17 at 08:32

score 0 · Answer 1 · answered Apr 25 '17 at 19:31

0

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs or /usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works.. dunno :-) Then call tesseract similar to this:

tesseract input.tif output nobatch letters

That will limit tesseract to recognize only the wanted characters

answered Apr 25 '17 at 19:31

Liam

6,009
4
39
53

"tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz" will include only the alphabets and then it will ignore the number ? and on this "tesseract input.tif output nobatch letters " what does "nobatch" referring to ? – Vibhor Bhatnagar Apr 26 '17 at 07:47
after doing this I am getting this `he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an izromrseaver medical internal tilted sos min seaa oslaelaoie urea arses ptoosloos` numbers got disappeared. I also want the number what can be done and also need to remove the unwanted text like `he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an` from starting. – Vibhor Bhatnagar Apr 26 '17 at 08:01
also its not like I have to remove the special characters, just want to remove unwanted special characters that was coming earlier in the output text – Vibhor Bhatnagar Apr 26 '17 at 08:12
I tried adding this `tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123456789@.` in config also, but the output is `h5g11hwvwhfvt7713fybcfuisiwiggfiwfarwtrtnifrrleiixwtfhfmjafg‌uiuwginnve mam.u a 5 1r v .w lzromrseaver medical internal tilled. 909 797 8922 0612812016 11124 91946 p.0051006` , so still problem exists. Can you help with this ? – Vibhor Bhatnagar Apr 26 '17 at 08:22
you could write a python script to remove unwanted carachters, if you don't know how to do that i could write it for you @VibhorBhatnagar . – Liam Apr 26 '17 at 08:40
I am not doing this in python, I am converting around 100,000 files to text by a scheduled job. So you mean after doing the conversion to text, I have to do something with the text file for removing the unwanted charcters. Yes you can provide me the logic for removing these unwanted charactes. – Vibhor Bhatnagar Apr 26 '17 at 09:07

Improve Tesseract OCR results by removing special characters

1 Answers1