17

I'm looking for an open source OCR library that runs on Linux. I need this to work for PNGs and PDFs. Mostly I would like to interface this library from java or ruby. Any idea if there is anything available?

Regards.

Chris
  • 171
  • 1
  • 1
  • 3
  • 1
    You have checked that the text isn't already available in the PDF, right? (I vaguely recall that PNG might also have the capability to store text, but I could be mistaken there). – Andrew Grimm May 15 '11 at 23:37
  • http://www.roncemer.com/software-development/java-ocr – Trick Aug 28 '12 at 08:49

3 Answers3

13

Tesseract is a very good OCR engine: https://github.com/tesseract-ocr/tesseract

The project has been launched by HP Labs and is now continued and sponsored by Google (for Google Books !). It is released under the Apache license, and it runs on Linux. It uses Tiff or PNGs files ; for PDFs, you will need to convert to one of these formats. I suppose that there is no binding so you should invoke this software as a subprogram...

olivierlemasle
  • 1,219
  • 16
  • 31
1

Cuneiform is free and does a decent job. You could invoke it as a subprogram but there's no language binding that I know of. It won't read PDFs directly but you can easily take apart PDFs that are sequences of scanned images to feed them to Cuneiform. There are also scripts to reassemble the images and text back into a searchable PDF.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
0

Try tesjeract, which uses JNI to call Tesseract OCR API.

For PDF, you'll need to convert them to image first, using GhostScript, for instance.

nguyenq
  • 8,212
  • 1
  • 16
  • 16