c# OCR can't recognize digits (tesseract 2)

Question

I'm trying to extract digits from the following:

It fails, I get a ~ in return. I'm using google's tesseract 2, using C# (open source c# wrapper) and now I'm wondering, is this image too crappy to be used for OCR?

Because imho the digits are straight clear.

Do you have any other OCR engine in mind that would nail this down?

EDIT

I've also tried with Asprise OCR (http://asprise.com/product/ocr/selector.php) but it fails to parse the image too...

Probably any engine you pay $ for would be able to get the digits - Abbyy or Oce' for example. — Otávio Décio, Mar 29 '11 at 15:47
This is for my company. And judging the tasks size, I'm sure they won't pay bucks for this, and I can't pay for it neither ^_^. This is the dilemma :/. But do you think my image is too crap for let's say *weak* ocr engines? — CoolStraw, Mar 29 '11 at 15:50
Not really bad, but I personally would never use tesseract for anything serious. It is an old, outdated and buggy engine. — Otávio Décio, Mar 29 '11 at 15:52
Would you recommend any other open source or even free engine? — CoolStraw, Mar 29 '11 at 15:54
I am yet to find a good open source OCR. I would be very interested on finding out if such exists as well. From what I know there is a lot of money that goes in document processing where things are charged by the click (document or page). — Otávio Décio, Mar 29 '11 at 16:06
You might have to train tesseract on the font in order to get it to recognize the numbers. See: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract2 — user7116, Mar 29 '11 at 16:24
Yes, that's a crappy image. The point size is way too small, the text isn't anti-aliased and thus too blocky. The latter could be a scanner artifact. — Hans Passant, Mar 29 '11 at 17:05
@sixletter thanks I'll have a look at this training crap. @Hans: aren't you aware of some techniques to make this image look better? like with bigger font and bolder color? — CoolStraw, Mar 30 '11 at 08:02
@six the training is only for adding new languages support. It actually has nothing to do with raising the OCR quality — CoolStraw, Mar 30 '11 at 08:16
@CoolStraw: I'm aware, however, often times tesseract will have issues with the font. Have you tried scaling the image with ImageMagick first? — user7116, Mar 30 '11 at 14:31
@Otávio Déciom, @CoolStraw - unfortunately, you can't ever trust any OCR 100%. ESPECIALLY with numbers. Although you'll probably find better if you pay. But the results for errors can't be catastrophic. — FastAl, Mar 30 '11 at 14:43
@FastAI if I can extract 80% of the result it's still very good. And I can't pay simply cause the client won't pay for this (it relates to a small task compared to the whole project). — CoolStraw, Mar 30 '11 at 14:53

score 7 · Accepted Answer · answered Mar 30 '11 at 14:38

7

I suggest resizing. I zoomed this page to 200% in IE, Took a screenshot, printed it to PDF and imported it into my program that uses tessnet. Tess nailed it! Unless I read the #s wrong :-)

Although confidence = 140 (under 100 is preferred if you wondered). Of course When i tried the original size, I didn't get ~; I got about 1/2 the #s right, a bunch of letters, and other garbage. Not good enough, but better.

t2 seems to like images a certain size.

My program does processing to get that to work. Suggest using .net GDI+ for converting to 32 bit, resizing with Interpolation mode High Quality Bicubic. This seems to 'fill in the gaps' a bit.

Play with sizes that work - I have found, too big, or too small, and tesseract performs differently.

Both issues are preprocessing, that's easy and you'd thing tesseract would try; however, I know how to resize and interpolate; I don't know how to OCR! So I am willing to settle.

answered Mar 30 '11 at 14:38

FastAl

6,194
2
36
60

Can I have your code that you use to re-work the image quality so I can plug it and test? Thanks – CoolStraw Mar 30 '11 at 14:50
@CoolStraw - Well, actually, I took a screenshoot of IE8 with Alfred Bolliger's PrintKey 2000, printed it with the PDFMachineWhite free version, Then my program automatically converted it to WMF using verydoc's pdf2vec, and, using VB.NET/GDI+, rendered the WMF in, as well as sized it, presented it in a UI, allowed my to drag a selection rectangle and pick OCR from a popup, saved a snippet for a separate process to OCR it using tessnet ...(I couldn't resist telling!) Don't work that hard. Use code like this (http://www.bobpowell.net/highqualitythumb.htm) to resize only enlarge, not shrink. – FastAl Mar 30 '11 at 17:46
Man you rock my world baby boy ! Thank you very much you solved my blocker issue ! – CoolStraw Mar 31 '11 at 09:40

score 1 · Answer 2 · answered May 14 '11 at 23:15

1

Your image's resolution is too low -- 96 DPI, perhaps it is a screenshot. Rescale it to 300 DPI, and tessnet2 should be able to recognize it.

answered May 14 '11 at 23:15

nguyenq

8,212
1
16
16

c# OCR can't recognize digits (tesseract 2)

2 Answers2

Linked