I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.
As you can see, the image is as clear as it can be.
When running tesseract test2.png stdout
it returns http:II11111111111111111111111111111111111
1111111111111111111.coml
Which is close, but not correct.
When setting the tessedit_char_whitelist
parameter to htp:/1.com
it recognises the string correctly (but I want more general recognition of URLs as well).
Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt
\n\*://\n\*
http://\n\*
\n\*.com
doesn't help with recognition. It still prefers I
over /
. (More details about the pattern file )
I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds
to include /
(and chs_trailing_punct{1,2}
for trailing /, but it doesn't seem to work. Specifying --user-words
doesn't help either. (With "words" being http://
)
Is there a way of specifying char priority for tesseract?
Version info:
$ tesseract -v
tesseract 3.04.01
leptonica-1.73
libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0