OCR Character Segmentation

Asked Aug 29 '15 at 10:06

Active Mar 16 '22 at 07:42

Viewed 1,145 times

I am doing penetration testing on my friend's website,
and I've spotted a captcha on the site which appeared to me an easy task to solve.

After applying a a Gaussian blur, and then simple thresholds, I have ended up with the following:

After feeding this to tesseract-ocr, I got the following output:
CLBTJE

So OCR failed to recognize the last two characters in the text.
I would imagine the issue would be primarily that tesseract can't segment the 'T' and the 'X'.

My main question then becomes, is it possible to force tesseract to do the segmenting, or do I have to implement such myself?

Here is the C# code I'm using to perform OCR:

var image = new Bitmap(pictureBox1.Image);
var ocr = new Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "QWERTYUIOPASDFGHJKLZXCVBNM" + "QWERTYUIOPASDFGHJKLZXCVBNM".ToLower()); 
ocr.Init(@"tessdata", "eng", false);
var result = ocr.DoOCR(image, new Rectangle());
foreach (Word word in result)
    MessageBox.Show("Confindece : " + word.Confidence + ", Word : " + word.Text);

edited Aug 29 '15 at 10:10

Lee Taylor

7,761
16
33
49

asked Aug 29 '15 at 10:06

user3788486

1

Thanks @Lee, Excuse me for the stupid mistakes! – user3788486 Aug 29 '15 at 10:14
What I'm thinking of as well, maybe it's possible to tell tesseract how many characters I have? If I could 'inform' tesseract that I only had 6 characters to solve, maybe it would be better at doing it? – user3788486 Aug 29 '15 at 10:21
I meet a similar problem. have you found a solution. – Yang Kui Sep 28 '15 at 14:42

OCR Character Segmentation

0 Answers0