I am doing the tesseract
conversion on some pdf
, image
, tiff
files saved in my db. But while doing it I am getting lots of garbage text output from various files. For example, in this case the image gave me the following text output.
“‘55“ .'Hï¬ï¬jï¬tï¬tf‘N‘Dfli’iisifagï¬'aï¬ffl‘rfé-wt-“ï¬â€˜-:-'!W',fl':ï¬fm:afJuirzv-int'g-v "3.0:†_‘ l 1: v .w
From:Beaver Medical Internal Med. 909 797 8922 06/28/2016 11:24 #946 RODS/006
As you can see it adds some extra special characters in the starting.
just want to know if there is any control param for removing such special characters from the output, because this is happening with many input files.
Note: This is not the original image, this is only the part of screenshot of pdf that I am converting to text and also the output is a part of original output.
My question is not similar to Limit characters tesseract is looking for , because that question is for ignoring things other than letters, but in my case there are some unwanted letters, numbers in the output text, which I need to remove after using the tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz
I am still getting wrong text he fhawfyhftiwlwwfuisipgkggfawfarwtwofrrletitwtfilfmjafgurrwsnnve mania a i v a an
in starting of output text and also it removes the numbers too. So just want to ask whether is there any way of removing these unwanted letters, special characters, numbers that are appearing in the starting.