5

Hi I am looking for a open-source java API that can convert tiff image to searchable pdf (OCR). I have research around but found nothing so far.

NOTE I have looked at this post but this API does not convert the image to pdf Java OCR implementation. However, I am still playing with the code a bit.

Community
  • 1
  • 1
Thang Pham
  • 38,125
  • 75
  • 201
  • 285

2 Answers2

6

You can convert images to PDF using iText. The hard thing here is doing the OCR, not creating the PDF.

I will warn you: any OCR engine that is worth using is going to cost you a significant amount of money. Free and/or open source ones are generally pet projects, proof of concept for some algorithm or another. Not suitable for real world OCR applications. Tesseract is probably the best of the bunch, but even that has accuracies that are far, far worse than commercial engines.

We have a commercial OCR application, and I've been down this path while evaluating engines - I'd suggest that you bite the bullet and reach out to the engine providers and get quotes: Abbyy (best accuracy, most expensive, slower), Expervision (fast, not as accurate, middle of the road price), Nuance (middle of the road speed, accuracy and price). None of these will be written in Java, so you should plan some time to develop JNI code around their APIs.

Good luck - it's a big project!

Kevin Day
  • 16,067
  • 8
  • 44
  • 68
  • What if all I want is to take a scanned pdf and convert it to a pdf with searchable text? Is Abbyy, Expervision and the bunch still the right route to go? – Don Cheadle Oct 21 '14 at 20:38
  • Yes - plus a ton of work to make sure the original content is preserved. We have a commercial application that does this - we've got 10 years of development into it, and I can assure you that the effort is substantial. – Kevin Day Oct 24 '14 at 20:00
  • 1
    :D my boss thinks this is something to do over the weekend – Don Cheadle Oct 24 '14 at 20:00
2

Cuneiform is free and easy to use, it will output in hocr format, which can then be used to generate an invisible text layer on a PDF using hocr2pdf tool, which is part of ExactImage.

Alasdair
  • 13,348
  • 18
  • 82
  • 138
  • Hi thank you for your input, can you provide more information on `Cuneiform` and `hocr` format? I cant seems to find much information on it.Thank you very much. – Thang Pham Feb 03 '12 at 14:44