Where do I start for Text Pattern Recognition - Java Based

Question

I am seriously considering doing a Optical Character Recognition program. I am well versed with Java and would love to know about libraries available out there. Basically, I want to convert something like the following to text. I will need to give manual interruption to specify a pattern. For example, I would need to ask user to mark f in this text, so that I know where f occurs.

enter image description here

I am a newbie to this entirely, so I dont mind learning from scratch as well. Need guidance.

Some suggestions in this [post](http://stackoverflow.com/q/1813881/3009). — highlycaffeinated, Jun 10 '11 at 19:44
Are you looking toward rolling your own OCR, or looking for an already existing OCR software in Java? — Atreys, Jun 10 '11 at 20:32

score 2 · Accepted Answer · answered Jun 10 '11 at 20:49

If you are thinking of coding an OCR program from scratch, reading up on techniques may be useful. I found an OCR Survey from 1996 which reviews some of the popular techniques from a decade and a half ago. Reading that might be helpful; track down papers it cites or papers which cite it.

Usually the process goes as follows:

find text
find characters in the text
extract features from the characters found
do pattern matching
report suspected character

While getting a user to annotate text is fun and exciting, finding a collection of handwriting which is already annotated might save you a lot of time, that way you can focus on the nuts and bolts of doing OCR rather than building your own database of annotated text.

To start with a slightly easier task you might want to consider building a system to detect handwritten digits. The USPS produced a corpus for developing systems to do this for zip code processing. The link was something I found with a quick search.

I also found http://stackoverflow.com/questions/850717/what-are-some-popular-ocr-algorithms when searching SO for [OCR]. There's another survey linked there, as well as plenty of discussion — Atreys, Jun 10 '11 at 21:21

score 2 · Answer 2 · answered Jun 10 '11 at 22:46

2

If you want to use/look at a library, you could try the Google-endorsed Tesseract.

answered Jun 10 '11 at 22:46

swilliams

326
4
5

Where do I start for Text Pattern Recognition - Java Based

2 Answers2