1

I'm using an OCR (Tesseract) to extract data from a document, this document must contains certain keyword to be valid, OCR isn't perfect so sometime he may read for example "Technlquos" instead of "Techniques".
So I'm wondering if there is a way in java to find "techniques" in a text even if it's read by OCR as "Technlquos" ? and the same thing for composed word : searching "Sciences Techniques" must accept "Sclences Technlquos", something like founding the closest word to the searched word and accepting it if it's close enough (75% matching for example) I found some solutions here but none of them is answering my question
Thank you

Community
  • 1
  • 1
hereForLearing
  • 1,209
  • 1
  • 15
  • 33
  • *I found some solutions [here](http://stackoverflow.com/questions/327513/fuzzy-string-search-in-java) but none of them is answering my question.* Explain why your problem is different if you want a different solution. – shmosel May 20 '16 at 19:43
  • If I have correctly understood the answers, they're for comparing two words and not searching a word or multiple words in a text – hereForLearing May 20 '16 at 19:47
  • Sounds like you're concerned about [this](http://stackoverflow.com/questions/327513/fuzzy-string-search-in-java#comment54049910_327595). But there are other solutions there. – shmosel May 20 '16 at 19:54
  • Thank you , that what I need bitap algorithm, add that like an answer so I can accept it – hereForLearing May 20 '16 at 21:15

1 Answers1

-1

In other OCR libraries, this can be done by keeping recognized word variants in the resulting text. Most likely, "Techniques" is found and considered suspicious by your OCR. If there is an option to keep suspicious word recognition variants, then you will be able to search for it.

Nadia Solovyeva
  • 207
  • 1
  • 7