Kofax Capture Recognition - I vs 1

Question

Using Kofax Capture 10 (SP1, FP2), I have recognition zones set up on some fields on a document. These fields are consistently recognizing I's as 1's. I have tried every combination of settings I can think of that don't obliterate all the characters in the field, to no avail. I have tried Advanced OCR and High Performance OCR, different filters for characters. All kinds of things.

What options can I try to automatically recognize this character? Should I tell the people producing the forms (they're generated by a computer) they need to try using a different font? Convince them that now is the time to consider using Validation?

My current field setup:

Kofax Advanced OCR with no custom settings except Maximize Accuracy in the advanced dialog. This has worked as well as anything else I have tried so far.

The font being used is 8 - 12 pt arial, btw.

I believe it's 200 dpi. The PDF files being imported are only 120 dpi so I did not waste effort trying to get more out of them than that. — Matt, Dec 19 '12 at 18:20
That's a very low resolution for successful OCR, I'm not surprised you're having issues! I presume you're using VRS during import to try and clean the image up as much as possible? — Lunatik, Jan 09 '13 at 10:16
Just regular recognition profiles. They seem to do a reasonable job. I guess I'm just not sure what VRS could do on top of that. That said we're up to about 90 - 96% accuracy on a certain set of test files, and it's just the one field that's really not holding up it's end of the bargain at one point. Kofax support doesn't even think there's much more I could do to increase it. I guess I could ask them to crank the DPI a little more, maybe to 300. — Matt, Jan 09 '13 at 15:22
At that level of extraction I'd be happy. I've found kerning to be at least as critical as actual font size when looking at that type of consistent mis-read, something that higher DPI may not help with. Most of our extraction is done in KTM so we have a lot more options in terms of scripting to try and catch these kind of 'known' issues. Good luck! — Lunatik, Jan 09 '13 at 15:48
Yeah, in that case I the font is probably a huge part of the issue. I've asked them if they can change the font to something more computer-legible like courier or times new roman but apparently that's an act of congress. I think in the end we'll be going with XML import for these to get 100% automatic accuracy — Matt, Jan 09 '13 at 16:32
Recognition profiles and image cleanup in VRS are two totally different animals. Image cleanup affects the image BEFORE recognition (it's a temporary or permanent processing), while recognition profiles process data AFTER image cleanup took place. — Daniel, Jan 20 '13 at 20:13
Yeah, but really what can it accomplish above and beyond what the recognition profiles are already doing to these computer-generated PDF files that are being imported? It's not like there are coffee stains or crumpled pages... — Matt, Jan 21 '13 at 15:08
You did not mention that you process computer generated documents. For these you do not need to use image cleanup. In fact you should avoid them as they do more harm than good. Look at the image cleanup profile: deskew? no need for e-documents. Despeckle? No need for e-documents. Character smooting? won't make characters any better... and we can go on. — Daniel, Jan 22 '13 at 10:55
You know, I guess I've never thought of that before. Normally I just start with the standard Kofax Advanced OCR and hope for the best. If that doesn't seem to be doing the trick I tweak from there because when you test the Advanced OCR it does not appear to be doing anything to the text even though despeckle and smoothing are defaulted ON. — Matt, Jan 22 '13 at 15:27
Since e-documents have no speckles, despeckle will not do anything. Unless you give insanely high values and you'll see punctuation and small characters disappear. Effects of line removal are easy to spot. The image enhancement features - except for thicken and thin - are really hard to grasp and see their results, as they are only minor adjustments and they only kick in in case of special conditions (like if there's a 1 pixel break on an edge) that do not really happen with e-docs. So my advice remains: turn off *all* image processing for e-docs. — Daniel, Jan 22 '13 at 20:13
It's very common for OCR's to mess up I and 1. If your field is numeric only, you should be able to force that, which would help tremendously, I would think. — johnjps111, Jun 12 '15 at 18:45

score 3 · Accepted Answer · answered Jan 22 '13 at 20:31

3

Validation is a MUST if OCR is involved, no matter if e-docs or paper docs are processed. For paper docs it is an even bigger must.

Use at least 11pt Arial and render the document as 300 dpi image. This will give you I'd say 99.9% accuracy (that is 1 character in every 1000 missed). Accuracy can drop if you have data where digits and letters are mixed within one word especially 1-I, 0-O, 6-G.

Recognition scripts can be used if you know that you have no such mixed data and OCR still returns mixed digits and letters. You can use the PostRecognition script event to catch the recognition result from the OCR engine and modify it with SBL or VB.NET scripts. But it greatly depends on the documents and data you process.

Image cleanup will not do any good for e-docs.

I'd say your best would be to use validation. At least that will push responsibility to the validation operator.

answered Jan 22 '13 at 20:31

Daniel

1,391
2
19
40

I agree that validation should probably be happening, but customer wants "automatic" and apparently can't spare the resources to validate hundreds of documents every day. I'll go ahead and mark this as the answer, although I doubt I'll be able to get them to do this as we've already started working towards a solution involving XML import with KIC-ED. – Matt Jan 23 '13 at 19:56
1

As I wrote on another forum, your customer has unrealistic expectations and no knowledge of technologies. Try to enlighten them that OCR will NEVER - I repeat: NEVER - be 100% accurate given enough samples, no matter what you do. This is not a Kofax issue, this is a technology problem: no matter which product they choose 100% can never be achieved. And if it's not 100% then you need someone to look at the data. You can speed up by automatically validating data where possible. The other solution is XML, as you wrote which will give you better results. – Daniel Jan 24 '13 at 10:09
I want to say the suggestion about removing image cleanup from the edocument recognition has worked better for me than any other advice I've ever gotten about this. I used this technique on another batch class for the same customer and so far it's GREAT. I'm pretty sure they don't cover that information at the Kofax training, or if they do I forgot it in the interim. – Matt May 03 '13 at 15:20
Image cleanup is more like an art than science. The fundamental problem is that it's Catch 22: in order to PROPERLY perform image cleanup you should identify the document. But in order to identify the document you must already have performed cleanup. Since there's no 'one-size-fits-all' solution, you need to test with a wide range of samples, adjust settings and ALWAYS re-test with the ENTIRE sample set to see if something got worse. – Daniel May 06 '13 at 09:19
Ideally where there are millions of documents to be digitized is manual validation of data a must ? – Jinxed Mar 30 '19 at 12:30

Kofax Capture Recognition - I vs 1

1 Answers1