Tweak tesseract for better detection of URLs in image

Question

I have an image that I can't get tesseract to recognise as text. All my input text will be URLs.

As you can see, the image is as clear as it can be.

When running tesseract test2.png stdout it returns http:II11111111111111111111111111111111111 1111111111111111111.coml Which is close, but not correct.

When setting the tessedit_char_whitelist parameter to htp:/1.com it recognises the string correctly (but I want more general recognition of URLs as well).

Passing in a pattern file that looks like below using command line tesseract test2.png stdout --user-patterns ./patterns.txt

\n\*://\n\*
http://\n\*
\n\*.com

doesn't help with recognition. It still prefers I over /. (More details about the pattern file )

I have also tried to set the parameters ok_repeated_ch_non_alphanum_wds to include / (and chs_trailing_punct{1,2} for trailing /, but it doesn't seem to work. Specifying --user-words doesn't help either. (With "words" being http://)

Is there a way of specifying char priority for tesseract?

Version info:

$ tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

score 3 · Accepted Answer · answered Jun 02 '16 at 21:14

You can achieve this by adding the following line to your unicharambigs file:

3 : I I 3 : / / 1

Extract the unicharambigs file with combine_tessdata -e eng.traineddata eng.unicharambigs
Edit the unicharambigs file, e.g. with nano eng.unicharambigs (make sure to use tabs after both 3s and the second /).
Overwrite the unicharambigs file in the traineddata file with the edited version combine_tessdata -o eng.traineddata eng.unicharambigs

Output using the amended traineddata file:

$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

I added the line `4 c o m l 4 c o m / 1` for last `/` `l` confusion, but your idea worked. — Diederik, Jun 04 '16 at 20:51

Tweak tesseract for better detection of URLs in image

1 Answers1